Scipy Kmeans and Kmeans2 and Sklearn KMeans - cluster-analysis

I have a big matrix with dimensions 2.000X98.000 and i want to perform unsupervised clustering on it. My problem is that when i try the clustering with Scipy.cluster.vq.kmeans2, i get an error that says "Matrix is not positive definite", when I try it with scipy.cluster.vq.kmeans it takes hours and hours to calculate and when i try it with sklearn.cluster.KMeans, the computation is fast and presents no errors. I have read the documentation for all the algorithms and i have researched through the internet for answers , but still i cannot understand this kind of difference between the three of them. Could someone explain to me this fundamental difference between them and why the need of a positive definite matrix is necessary only for scipy.kmeans2? Thank you in advance for your time and consideration.

Related

How to reduce matrix dimension using PCA in matlab? [duplicate]

This question already has answers here:
Matlab - PCA analysis and reconstruction of multi dimensional data
(2 answers)
Closed 7 years ago.
I wanted to reduce a bigger dimension matrix i.e. 2000*768; to some lower dimensions i.e 200*768 or 400*400 (not fixed); using principal component analysis (PCA) in MatLab. I wanted to do it for feature dimension reduction. How can I do it easily? And please suggest me some tutorials to understand PCA better.
Thanks in advance.
PCA is a really useful tool for dimensionality reduction, but it should be used when you understand exactly what it is doing and what you are getting out of it. For a good intro click here - it is a decent explanation which is not too hard to follow. There is also this article which is a quick DIY walkthrough which may help you understand better what is going on.
Once you know what you are getting, PCA is easy in matlab. Just type pca(X) and you can perform it on data set X.
What you get out is very much dependent on what you get in (e.g. things like normalisation are very important for input data), and you can use extra parameters that are worth knowing about to set up you principal component analysis. See matlab's guide here.
What you are looking for in dimensionality reduction to best represent the data with as few components as possible. Using the explained output of [coeff,score,latent,tsquared,explained] = pca(X) you get a vector telling you how much of the data is explained by each principal component, which gives you a good indication of whether dimensionality reduction can be done.

Naive bayes classifier calculation

I'm trying to use naive Bayes classifier to classify my dataset.My questions are:
1- Usually when we try to calculate the likehood we use the formula:
P(c|x)= P(c|x1) * P(c|x2)*...P(c|xn)*P(c) . But in some examples it says in order to avoid getting very small results we use P(c|x)= exp(log(c|x1) + log(c|x2)+...log(c|xn) + logP(c)). can anyone explain more to me the difference between these two formula and are they both used to calculate the "likehood" or the sec one is used to calculate something called "information gain".
2- In some cases when we try to classify our datasets some joints are null. Some ppl use "LAPLACE smoothing" technique in order to avoid null joints. Doesnt this technique influence on the accurancy of our classification?.
Thanks in advance for all your time. I'm just new to this algorithm and trying to learn more about it. So is there any recommended papers i should read? Thanks alot.
I'll take a stab at your first question, assuming you lost most of the P's in your second equation. I think the equation you are ultimately driving towards is:
log P(c|x) = log P(c|x1) + log P(c|x2) + ... + log P(c)
If so, the examples are pointing out that in many statistical calculations, it's often easier to work with the logarithm of a distribution function, as opposed to the distribution function itself.
Practically speaking, it's related to the fact that many statistical distributions involve an exponential function. For example, you can find where the maximum of a Gaussian distribution K*exp^(-s_0*(x-x_0)^2) occurs by solving the mathematically less complex problem (if we're going through the whole formal process of taking derivatives and finding equation roots) of finding where the maximum of its logarithm K-s_0*(x-x_0)^2 occurs.
This leads to many places where "take the logarithm of both sides" is a standard step in an optimization calculation.
Also, computationally, when you are optimizing likelihood functions that may involve many multiplicative terms, adding logarithms of small floating-point numbers is less likely to cause numerical problems than multiplying small floating point numbers together is.

Optimization stops prematurely (MATLAB)

I'm trying my best to work it out with fmincon in MATLAB. When I call the function, I get one of the two following errors:
Number of function evaluation exceeded, or
Number of iteration exceeded.
And when I look at the solution so far, it is way off the one intended (I know so because I created a minimum vector).
Now even if I increase any of the tolerance constraint or max number of iterations, I still get the same problem.
Any help is appreciated.
First, if your problem can actually be cast as linear or quadratic programming, do that first.
Otherwise, have you tried seeding it with different starting values x0? If it's starting in a bad place, it may be much harder to get to the optimum.
If it's possible for you to provide the gradient of the function, that can help the optimizer tremendously (though obviously only if you can find it some way other than numerical differentiation). Similarly, if you can provide the (full or sparse) Hessian relatively cheaply, you're golden.
You can also try using a different algorithm in the solver.
Basically, fmincon by default has almost no info about the function it's trying to optimize, and providing more can be extremely helpful. If you can tell us more about the objective function, we might be able to give more tips.
The L1 norm is not differentiable. That can make it difficult for the algorithm to converge to a point where one of the residuals is zero. I suspect this is why number of iterations limits are exceeded. If your original problem is
min norm(residual(x),1)
s.t. Aeq*x=beq
you can reformulate the problem differentiably, as follows
min sum(b)
s.t. -b(i)<=residual(x,i)<=b(i)
Aeq*x=beq
where residual(x,i) is the i-th residual, x is the original vector of unknowns, and b is a further unknown vector of bounds that you add to the problem.

matlab: eigs appears to give out inconsistent results

I'm trying to get the two smallest eigenvectors of a matrix:
[v,c]=eigs(lap,2,'sm');
The result v is "correct" ~66% of time. When I say correct I mean "looks right" in terms of the problem I am trying to solve, of course.
The other part of the time I get different vectors.
I know eigs uses a numerical solver, and that it's initial guess is random, so that explains that. What bothers me is according to matlab's documentation I see that the tolerance used as criteria to stop is set to eps initially, and I tried increasing opts.maxit=10000000;, but it doesn't appear to affect the results nor the run time, so I assume the tolerance is met before the maximum iteration number is reached.
What can I do to get consistent results? There's no problem in terms of computation time.
Please note that the matrix is very large and sparse, so I cannot work with eig, only with eigs

How do I obtain the eigenvalues of a huge matrix (size: 2x10^5)

I have a matrix of size 200000 X 200000 .I need to find the eigen values for this .I was using matlab till now but as the size of the matrix is unhandleable by matlab i have shifted to perl and now even perl is unable to handle this huge matrix it is saying out of memory.I would like to know if i can find out the eigen values of this matrix using some other programming language which can handle such huge data. The elements are not zeros mostly so no option of going for sparse matrix. Please help me in solving this.
I think you may still have luck with MATLAB. Take a look into their distributed computing toolbox. You'd need some kind of parallel environment, a computing cluster.
If you don't have a computational cluster, you might look into distributed eigenvalue/vector calculation methods that could be employed on Amazon EC2 or similar.
There is also a discussion of parallel eigenvalue calculation methods here, which may direct you to better libraries and programming approaches than Perl.