Naive bayes classifier calculation - classification

I'm trying to use naive Bayes classifier to classify my dataset.My questions are:
1- Usually when we try to calculate the likehood we use the formula:
P(c|x)= P(c|x1) * P(c|x2)*...P(c|xn)*P(c) . But in some examples it says in order to avoid getting very small results we use P(c|x)= exp(log(c|x1) + log(c|x2)+...log(c|xn) + logP(c)). can anyone explain more to me the difference between these two formula and are they both used to calculate the "likehood" or the sec one is used to calculate something called "information gain".
2- In some cases when we try to classify our datasets some joints are null. Some ppl use "LAPLACE smoothing" technique in order to avoid null joints. Doesnt this technique influence on the accurancy of our classification?.
Thanks in advance for all your time. I'm just new to this algorithm and trying to learn more about it. So is there any recommended papers i should read? Thanks alot.

I'll take a stab at your first question, assuming you lost most of the P's in your second equation. I think the equation you are ultimately driving towards is:
log P(c|x) = log P(c|x1) + log P(c|x2) + ... + log P(c)
If so, the examples are pointing out that in many statistical calculations, it's often easier to work with the logarithm of a distribution function, as opposed to the distribution function itself.
Practically speaking, it's related to the fact that many statistical distributions involve an exponential function. For example, you can find where the maximum of a Gaussian distribution K*exp^(-s_0*(x-x_0)^2) occurs by solving the mathematically less complex problem (if we're going through the whole formal process of taking derivatives and finding equation roots) of finding where the maximum of its logarithm K-s_0*(x-x_0)^2 occurs.
This leads to many places where "take the logarithm of both sides" is a standard step in an optimization calculation.
Also, computationally, when you are optimizing likelihood functions that may involve many multiplicative terms, adding logarithms of small floating-point numbers is less likely to cause numerical problems than multiplying small floating point numbers together is.


Matlab Confidence Interval for Degrees of Freedom

I would like to calculate a Confidence Interval along with my Degrees of Freedom (DOF) estimation in Matlab. I am trying to run the following line of code:
[R, DoF, ciDOF] = copulafit('t', U); % fit the copula
The code line without the "ciDOF" arguments takes between 1-3 hours to run with my data. I tried to run the code with the "ciDOF" argument several times, but the calculations seem to take very long (I stopped the calculation after 8 hours). No error message is generated.
Does anyone have experience with this argument and could kindly tell me how long I should expect the calculation to take (the size of my data is 167*19) and if I have specified the "ciDOF" argument correctly?
Many thanks for the help!
If your data matrix U is of size 167 x 19, then what you are asking for is a copula-fit distribution dependent on 19-dimensions, making your copula a distribution in a 20-dimensional space with 19 dependent variables.
This is almost definitely why it is taking so long, because whether it is your intention or not, you are asking MATLAB to solve a minimization problem of taking 19 marginal distributions and come-up with the 19-variate joint distribution (the copula) where each marginal distribution (represented by 167 x 1 row-vectors) is uniform.
Most-likely this is a limit of the MATLAB implementation that is iterating through many independent computations and then trying to combine them together to fit the joint distribution's ideal conditions.
First and foremost -- and not to be insulting or insinuating -- you should definitely check that you really are trying to find a 19-variate copula. Also, just in case, make sure that your matrix U is oriented in the proper way, because if you have it transposed, you could be trying to ask for the solution to a 167-variate distribution.
But, if this is what you are actually trying to do, there is not really an easy way to predict how long it will take or how long it should take. Even with multiple dimensions, if your marginals are simple or uniform already, that would greatly reduce the copula computation. But, really, there is no way to tell.
Although this may seem like a cop-out, you may actually have better luck switching from MATLAB to R, especially if you have a lot of multivariate data, and you will probably find a lot more functionality in R than MATLAB. R is freely available and comes with a Graphical User Interface (GUI), in-case you aren't comfortable with command-line programming.
There are many more sources, but here is one PDF lecture on computing copula-fits in R:

What's the best way to calculate a numerical derivative in MATLAB?

(Note: This is intended to be a community Wiki.)
Suppose I have a set of points xi = {x0,x1,x2,...xn} and corresponding function values fi = f(xi) = {f0,f1,f2,...,fn}, where f(x) is, in general, an unknown function. (In some situations, we might know f(x) ahead of time, but we want to do this generally, since we often don't know f(x) in advance.) What's a good way to approximate the derivative of f(x) at each point xi? That is, how can I estimate values of dfi == d/dx fi == df(xi)/dx at each of the points xi?
Unfortunately, MATLAB doesn't have a very good general-purpose, numerical differentiation routine. Part of the reason for this is probably because choosing a good routine can be difficult!
So what kinds of methods are there? What routines exist? How can we choose a good routine for a particular problem?
There are several considerations when choosing how to differentiate in MATLAB:
Do you have a symbolic function or a set of points?
Is your grid evenly or unevenly spaced?
Is your domain periodic? Can you assume periodic boundary conditions?
What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
Does it matter to you that your derivative is evaluated on the same points as your function is defined?
Do you need to calculate multiple orders of derivatives?
What's the best way to proceed?
These are just some quick-and-dirty suggestions. Hopefully somebody will find them helpful!
1. Do you have a symbolic function or a set of points?
If you have a symbolic function, you may be able to calculate the derivative analytically. (Chances are, you would have done this if it were that easy, and you would not be here looking for alternatives.)
If you have a symbolic function and cannot calculate the derivative analytically, you can always evaluate the function on a set of points, and use some other method listed on this page to evaluate the derivative.
In most cases, you have a set of points (xi,fi), and will have to use one of the following methods....
2. Is your grid evenly or unevenly spaced?
If your grid is evenly spaced, you probably will want to use a finite difference scheme (see either of the Wikipedia articles here or here), unless you are using periodic boundary conditions (see below). Here is a decent introduction to finite difference methods in the context of solving ordinary differential equations on a grid (see especially slides 9-14). These methods are generally computationally efficient, simple to implement, and the error of the method can be simply estimated as the truncation error of the Taylor expansions used to derive it.
If your grid is unevenly spaced, you can still use a finite difference scheme, but the expressions are more difficult and the accuracy varies very strongly with how uniform your grid is. If your grid is very non-uniform, you will probably need to use large stencil sizes (more neighboring points) to calculate the derivative at a given point. People often construct an interpolating polynomial (often the Lagrange polynomial) and differentiate that polynomial to compute the derivative. See for instance, this StackExchange question. It is often difficult to estimate the error using these methods (although some have attempted to do so: here and here). Fornberg's method is often very useful in these cases....
Care must be taken at the boundaries of your domain because the stencil often involves points that are outside the domain. Some people introduce "ghost points" or combine boundary conditions with derivatives of different orders to eliminate these "ghost points" and simplify the stencil. Another approach is to use right- or left-sided finite difference methods.
Here's an excellent "cheat sheet" of finite difference methods, including centered, right- and left-sided schemes of low orders. I keep a printout of this near my workstation because I find it so useful.
3. Is your domain periodic? Can you assume periodic boundary conditions?
If your domain is periodic, you can compute derivatives to a very high order accuracy using Fourier spectral methods. This technique sacrifices performance somewhat to gain high accuracy. In fact, if you are using N points, your estimate of the derivative is approximately N^th order accurate. For more information, see (for example) this WikiBook.
Fourier methods often use the Fast Fourier Transform (FFT) algorithm to achieve roughly O(N log(N)) performance, rather than the O(N^2) algorithm that a naively-implemented discrete Fourier transform (DFT) might employ.
If your function and domain are not periodic, you should not use the Fourier spectral method. If you attempt to use it with a function that is not periodic, you will get large errors and undesirable "ringing" phenomena.
Computing derivatives of any order requires 1) a transform from grid-space to spectral space (O(N log(N))), 2) multiplication of the Fourier coefficients by their spectral wavenumbers (O(N)), and 2) an inverse transform from spectral space to grid space (again O(N log(N))).
Care must be taken when multiplying the Fourier coefficients by their spectral wavenumbers. Every implementation of the FFT algorithm seems to have its own ordering of the spectral modes and normalization parameters. See, for instance, the answer to this question on the Math StackExchange, for notes about doing this in MATLAB.
4. What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
For many purposes, a 1st or 2nd order finite difference scheme may be sufficient. For higher precision, you can use higher order Taylor expansions, dropping higher-order terms.
If you need to compute the derivatives within a given tolerance, you may want to look around for a high-order scheme that has the error you need.
Often, the best way to reduce error is reducing the grid spacing in a finite difference scheme, but this is not always possible.
Be aware that higher-order finite difference schemes almost always require larger stencil sizes (more neighboring points). This can cause issues at the boundaries. (See the discussion above about ghost points.)
5. Does it matter to you that your derivative is evaluated on the same points as your function is defined?
MATLAB provides the diff function to compute differences between adjacent array elements. This can be used to calculate approximate derivatives via a first-order forward-differencing (or forward finite difference) scheme, but the estimates are low-order estimates. As described in MATLAB's documentation of diff (link), if you input an array of length N, it will return an array of length N-1. When you estimate derivatives using this method on N points, you will only have estimates of the derivative at N-1 points. (Note that this can be used on uneven grids, if they are sorted in ascending order.)
In most cases, we want the derivative evaluated at all points, which means we want to use something besides the diff method.
6. Do you need to calculate multiple orders of derivatives?
One can set up a system of equations in which the grid point function values and the 1st and 2nd order derivatives at these points all depend on each other. This can be found by combining Taylor expansions at neighboring points as usual, but keeping the derivative terms rather than cancelling them out, and linking them together with those of neighboring points. These equations can be solved via linear algebra to give not just the first derivative, but the second as well (or higher orders, if set up properly). I believe these are called combined finite difference schemes, and they are often used in conjunction with compact finite difference schemes, which will be discussed next.
Compact finite difference schemes (link). In these schemes, one sets up a design matrix and calculates the derivatives at all points simultaneously via a matrix solve. They are called "compact" because they are usually designed to require fewer stencil points than ordinary finite difference schemes of comparable accuracy. Because they involve a matrix equation that links all points together, certain compact finite difference schemes are said to have "spectral-like resolution" (e.g. Lele's 1992 paper--excellent!), meaning that they mimic spectral schemes by depending on all nodal values and, because of this, they maintain accuracy at all length scales. In contrast, typical finite difference methods are only locally accurate (the derivative at point #13, for example, ordinarily doesn't depend on the function value at point #200).
A current area of research is how best to solve for multiple derivatives in a compact stencil. The results of such research, combined, compact finite difference methods, are powerful and widely applicable, though many researchers tend to tune them for particular needs (performance, accuracy, stability, or a particular field of research such as fluid dynamics).
Ready-to-Go Routines
As described above, one can use the diff function (link to documentation) to compute rough derivatives between adjacent array elements.
MATLAB's gradient routine (link to documentation) is a great option for many purposes. It implements a second-order, central difference scheme. It has the advantages of computing derivatives in multiple dimensions and supporting arbitrary grid spacing. (Thanks to #thewaywewalk for pointing out this glaring omission!)
I used Fornberg's method (see above) to develop a small routine (nderiv_fornberg) to calculate finite differences in one dimension for arbitrary grid spacings. I find it easy to use. It uses sided stencils of 6 points at the boundaries and a centered, 5-point stencil in the interior. It is available at the MATLAB File Exchange here.
The field of numerical differentiation is very diverse. For each method listed above, there are many variants with their own set of advantages and disadvantages. This post is hardly a complete treatment of numerical differentiation.
Every application is different. Hopefully this post gives the interested reader an organized list of considerations and resources for choosing a method that suits their own needs.
This community wiki could be improved with code snippets and examples particular to MATLAB.
I believe there is more in to these particular questions. So I have elaborated on the subject further as follows:
(4) Q: What level of accuracy are you looking for? Do you need to compute the derivatives within a given tolerance?
A: The accuracy of numerical differentiation is subjective to the application of interest. Usually the way it works is, if you are using the ND in forward problem to approximate the derivatives to estimate features from signal of interest, then you should be aware of noise perturbations. Usually such artifacts contain high frequency components and by the definition of the differentiator, the noise effect will be amplified in the magnitude order of $i\omega^n$. So, increasing the accuracy of differentiator (increasing the polynomial accuracy) will no help at all. In this case you should be able to cancelt the effect of noise for differentiation. This can be done in casecade order: first smooth the signal, and then differentiate. But a better way of doing this is to use "Lowpass Differentiator". A good example of MATLAB library can be found here.
However, if this is not the case and you're using ND in inverse problems, such as solvign PDEs, then the global accuracy of differentiator is very important. Depending on what kind of bounady condition (BC) suits your problem, the design will be adapted accordingly. The rule of thump is to increase the numerical accuracy known is the fullband differentiator. You need to design a derivative matrix that takes care of suitable BC. You can find comprehensive solutions to such designs using the above link.
(5) Does it matter to you that your derivative is evaluated on the same points as your function is defined?
A: Yes absolutely. The evaluation of the ND on the same grid points is called "centralized" and off the points "staggered" schemes. Note that using odd order of derivatives, centralized ND will deviate the accuracy of frequency response of the differentiator. Therefore, if you're using such design in inverse problems, this will perturb your approximation. Also, the opposite applies to the case of even order of differentiation utilized by staggered schemes. You can find comprehensive explanation on this subject using the link above.
(6) Do you need to calculate multiple orders of derivatives?
This totally depends on your application at hand. You can refer to the same link I have provided and take care of multiple derivative designs.

Tolerances in Numerical quadrature - MATLAB

What is the difference between abtol and reltol in MATLAB when performing numerical quadrature?
I have an triple integral that is supposed to generate a number between 0 and 1 and I am wondering what would be the best tolerance for my application?
Any other ideas on decreasing the time of integral3 execution.
Also does anyone know whether integral3 or quadgk is faster?
When performing the integration, MATLAB (or most any other integration software) computes a low-order solution qLow and a high-order solution qHigh.
There are a number of different methods of computing the true error (i.e., how far either qLow or qHigh is from the actual solution qTrue), but MATLAB simply computes an absolute error as the difference between the high and low order integral solutions:
errAbs = abs(qLow - qHigh).
If the integral is truly a large value, that difference may be large in an absolute sense but not a relative sense. For example, errAbs might be 1E3, but qTrue is 1E12; in that case, the method could be said to converge relatively since at least 8 digits of accuracy has been reached.
So MATLAB also considers the relative error :
errRel = abs(qLow - qHigh)/abs(qHigh).
You'll notice I'm treating qHigh as qTrue since it is our best estimate.
Over a given sub-region, if the error estimate falls below either the absolute limit or the relative limit times the current integral estimate, the integral is considered converged. If not, the region is divided, and the calculation repeated.
For the integral function and integral2/integral3 functions with the iterated method, the low-high solutions are a Gauss-Kronrod 7-15 pair (the same 7-th order/15-th order set used by quadgk.
For the integral2/integral3 functions with the tiled method, the low-high solutions are a Gauss-Kronrod 3-7 pair (I've never used this option, so I'm not sure how it compares to others).
Since all of these methods come down to a Gauss-Kronrod quadrature rule, I'd say sticking with integral3 and letting it do the adaptive refinement as needed is the best course.

Solving a non-polynomial equation numerically

I've got a problem with my equation that I try to solve numerically using both MATLAB and Symbolic Toolbox. I'm after several source pages of MATLAB help, picked up a few tricks and tried most of them, still without satisfying result.
My goal is to solve set of three non-polynomial equations with q1, q2 and q3 angles. Those variables represent joint angles in my industrial manipulator and what I'm trying to achieve is to solve inverse kinematics of this model. My set of equations looks like this:
I'm solving it with
numeric::solve([z1,z2,z3], [q1=x1..x2,q2=x3..x4,q3=x5..x6], MultiSolutions)
Changing the xn constant according to my needs. Yet I still get some odd results, the q1 var is off by approximately 0.1 rad, q2 and q3 being off by ~0.01 rad. I don't have much experience with numeric solve, so I just need information, should it supposed to look like that?
And, if not, what valid option do you suggest I should take next? Maybe transforming this equation to polynomial, maybe using a different toolbox?
Or, if trying to do this in Matlab, how can you limit your solutions when using solve()? I'm thinking of an equivalent to Symbolic Toolbox's assume() and assumeAlso.
I would be grateful for your help.
The numerical solution of a system of nonlinear equations is generally taken as an iterative minimization process involving the minimization (i.e., finding the global minimum) of the norm of the difference of left and right hand sides of the equations. For example fsolve essentially uses Newton iterations. Those methods perform a "deterministic" optimization: they start from an initial guess and then move in the unknowns space essentially according to the opposite of the gradient until the solution is not found.
You then have two kinds of issues:
Local minima: the stopping rule of the iteration is related to the gradient of the functional. When the gradient becomes small, the iterations are stopped. But the gradient can become small in correspondence to local minima, besides the desired global one. When the initial guess is far from the actual solution, then you are stucked in a false solution.
Ill-conditioning: large variations of the unknowns can be reflected into large variations of the data. So, small numerical errors on data (for example, machine rounding) can lead to large variations of the unknowns.
Due to the above problems, the solution found by your numerical algorithm will be likely to differ (even relevantly) from the actual one.
I recommend that you make a consistency test by choosing a starting guess, for example when using fsolve, very close to the actual solution and verify that your final result is accurate. Then you will discover that, by making the initial guess more far away from the actual solution, your result will be likely to show some (even large) errors. Of course, the entity of the errors depend on the nature of the system of equations. In some lucky cases, those errors could keep also very small.

Why doesn't k-means give the global minima?

I read that the k-means algorithm only converges to a local minima and not to a global minima. Why is this? I can logically think of how initialization could affect the final clustering and there is a possibility of sub-optimum clustering, but I did not find anything that will mathematically prove that.
Also, why is k-means an iterative process?
Can't we just partially differentiate the objective function w.r.t. to the centroids, equate it to zero to find the centroids that minimizes this function? Why do we have to use gradient descent to reach the minimum step by step?
. c .
. c .
Where c is a cluster centroid. The algorithm will stop, but a better solution is:
. .
c c
. .
With regards to a proof - You don't require a mathematical proof to prove that something isn't always true, you just need a single counter-example, as provided above. You can probably convert the above into a mathematical proof, but this is unnecessary and generally requires a lot of work; even in academia it is accepted to merely give a counter-example to disprove something.
The k-means algorithm is by definition an iterative process, it's simply the way it works. The problem of clustering is NP-hard, thus using an exact algorithm to calculate the centroids would take immensely long.
Don't mix the problem and the algorithm.
The k-means problem is finding the least-squares assignment to centroids.
There are multiple algorithms for finding a solution.
There is an obvious approach to find the global optimum: enumerating all k^n possible assignments - that will yield a global minimum, but in exponential runtime.
Much more attention was put to finding an approximate solution in faster time.
The Lloyd/Forgy algorithm is an EM-style iterative model refinement approach, that is guaranteed to converge to a local minimum simply because there is a finite number of states, and the objective function must decrease in every step. This algorithm runs in O(n*k*i) where i << n usually, but it may find a local minimum only.
The MacQueens method is technically not iterative. It's a single-pass, one-element-at-a-time algorithm that will not even find a local minimum in the Lloyd sense. (You can however run it multiple passes over the data set, until convergence, to get a local minimum too!) If you do a single pass, its in O(n*k), for multiple passes add i. It may or may not take more passes than Lloyd.
Then there is Hartigan and Wong. I don't remember the details, IIRC it was a clever, more lazy, variant of Lloyd/Forgy, so probably in O(n*k*i), too (although probably not recomputing all n*k distances for later iterations?)
You could also do a randomized alogrithm that just tests l random assignments. It probably won't find a minimum at all, but run in "linear" time O(n*l).
Oh, and you can try different random initializations, to improve your chances of finding the global minimum. Add a factor t for the number of trials...