relation between support vectors and accuracy in case of RBF kernel - matlab

I am using RBF kernel matlab function.
On couple of dataset as I go on increasing sigma value the number of support vectors increase and accuracy increases.
While in case of one data set, as I increase the sigma value, the support vectors decrease and accuracy increases.
I am not able to analyze the relation between support vectors and accuracy in case of RBF kernel.

The number of support vectors doesn't have a direct relationship to accuracy; it depends on the shape of the data (and your C/nu parameter).
Higher sigma means that the kernel is a "flatter" Gaussian and so the decision boundary is "smoother"; lower sigma makes it a "sharper" peak, and so the decision boundary is more flexible and able to reproduce strange shapes if they're the right answer. If sigma is very high, your data points will have a very wide influence; if very low, they will have a very small influence.
Thus, often, increasing the sigma values will result in more support vectors: for more-or-less the same decision boundary, more points will fall within the margin, because points become "fuzzier." Increased sigma also means, though, that the slack variables "moving" points past the margin are more expensive, and so the classifier might end up with a much smaller margin and fewer SVs. Of course, it also might just give you a dramatically different decision boundary with a completely different number of SVs.
In terms of maximizing accuracy, you should be doing a grid search on many different values of C and sigma and choosing the one that gives you the best performance on e.g. 3-fold cross-validation on your training set. One reasonable approach is to choose from e.g. 2.^(-9:3:18) for C and median_eval * 2.^(-4:2:10); those numbers are fairly arbitrary, but they're ones I've used with success in the past.

Related

Matlab: Dealing with denorm performance cost conversion when close to realmin in backprop

I understand that if a number gets closer to zero than realmin, then Matlab converts the double to a denorm . I am noticing this causes significant performance cost. In particular I am using a gradient descent algorithm that when near convergence, the gradients (in backprop for my bespoke neural network) drop below realmin such that the algorithm incurs heavy performance cost (due to, I am assuming, type conversion behind the scenes). I have used the following code to validate my gradient matrices so that no numbers falls below realmin:
function mat= validateSmallDoubles(obj, mat, threshold)
mat= mat.*(abs(mat)>threshold);
end
Is this usual practice and what value should threshold take (obviously you want this as close to realmin as possible, but not too close otherwise any additional division operations will send some elements of mat below realmin after validation)?. Also, specifically for neural networks, where are the best places to do gradient validation without ruining the network's ability to learn?. I would be grateful to know what solutions people with experience in training neural networks have? I am sure this is a problem for all languages. Tentative threshold values have ruined my network's learning.
I do not know if it is somehow related to your problem, but I had a similar problem with underflows while doing exponentially weighted average of gradients (say while implementing Momentum or Adam).
In particular, at some point you do something like:
v := 0.9*v + 0.1*gradient where v is the exponentially weighted average of your gradient g. If in a lot of successive iterations a same element of your g matrix remains 0, your v is quickly becoming very small and you hit dernormals.
So the problem, is why all those zeros ? In my case the culprit where the ReLu units which outputed a lot of zeros (if x<0 , relu(x) is zero). Because when Relu outputs zero on a given neurons the related weight has no effect it means the corresponding partial derivative will be zero in g. So it happened to me that in a lot of successive iterations that particular neuron was not fired.
To avoiding having zero activations (and derivatives), I used "leaky relu" so to have a very small derivative instead.
Another solution, is to use gradient clipping before applying your weighted average to threshold your gradients to a minimum value. Which is quite similar to what you did.
I traced the diminishing gradient occurrences to the Adam SGD optimiser - the biased moving average matrix calculations in the Adam optimiser were causing matlab to carry out the denorm operation. I simply thresholded the matrix elements for each layer after these calculations, with threshold=10*realmin, to zero without any effect on learning. I have yet to investigate why my moving averages were getting so close to zero as my architecture and weight initialisation priors would normally mitigate this.

Training data range for Neural Network

Is it better for Neural Network to use smaller range of training data or it does not matter? For example, if I want to train an ANN with angles (values of float) should I pass those values in degrees [0; 360] or in radians [0; 6.28] or maybe all values should be normalized to range [0; 1]? Does the range of training data affects ANN learing quality?
My Neural Network has 6 input neurons, 1 hidden layer and I am using sigmoid symmetric activation function (tanh).
For the neural network it doesn't matter whether the data is normalised.
However, the performance of the training method can vary a lot.
In a nutshell: typically the methods prefer variables which have larger values. This might send the training method off-track.
Crucial for most NN training methods is that all dimensions of the training data have the same domain. If all your variables are angles it doesn't matter, whether they are [0,1) or [0,2*pi) or [0,360) as long as they have the same domain. However, you should avoid having one variable for the angle [0,2*pi) and another variable for the distance in mm where distance can be much larger then 2000000mm.
Two cases where an algorithm might suffer in these cases:
(a) regularisation: if the weights of the NN are force to be small a tiny change of a weight controlling the input of a large domain variable has a much larger impact, than for a small domain
(b) gradient descent: if the step size is limited you have similar effects.
Recommendation: All variables should have the same domain size whether it is [0,1] or [0,2*pi] or ... doesn't matter.
Addition: for many domain "z-score normalisation" works extremely well.
The data points range affects the way you train a model. Suppose the range of values for features in the data set is not normalized. Then, depending on your data, you may end up having elongated Ellipses for the data points in the feature space and the learning model will have a very hard time learning the manifold on which the data points lie on (learn the underlying distribution). Also, in most cases the data points are sparsely spread in the feature space, if not normalized (see this). So, the take-home message is to normalize the features when possible.

Why does classifier accuracy drop after PCA, even though 99% of the total variance is covered?

I have a 500x1000 feature vector and principal component analysis says that over 99% of total variance is covered by the first component. So I replace 1000 dimension point by 1 dimension point giving 500x1 feature vector(using Matlab's pca function). But, my classifier accuracy which was initially around 80% with 1000 features now drops to 30% with 1 feature even though more than 99% of the variance is accounted by this feature. What could be the explanation to this or are my methods wrong?
(This question partly arises from my earlier question Significance of 99% of variance covered by the first component in PCA)
Edit:
I used weka's principal components method to perform the dimensionality reduction and support vector machines(SVM) classifier.
Principal Components do not necessarily have any correlation to classification accuracy. There could be a 2-variable situation where 99% of the variance corresponds to the first PC but that PC has no relation to the underlying classes in the data. Whereas the second PC (which only contributes to 1% of the variance) is the one that can separate the classes. If you only keep the first PC, then you lose the feature that actually provides the ability to classify the data.
In practice, smaller (lower variance) PCs often are associated with noise so there can be benefit in removing them but there is no guarantee of this.
Consider a case where you have two variables: a person's mass (in grams) and body temperature (in degrees Celsius). You want to predict which people have the flu and which do not. In this case, weight has a much greater variance but probably no correlation to the flu, whereas temperature, which has low variance, has a strong correlation to the flu. After the Principal Components transformation, the first PC will be strongly aligned with mass (since it has much greater variance) so if you dropped the second PC, would be losing almost all of your classification accuracy.
It is important to remember that Principal Components is an unsupervised transformation of the data. It does not consider labels of your training data when calculating the transformation (as opposed to something like Fisher's linear discriminant).

Obtaining poles and zeros from frequency response magnitude and Phase plot

Is there any Matlab code available to determine the poles and zeros directly from the frequency response Plot.Any reference will be greatly appreciated.
Obtaining Poles and zeros from experimental data is a nontrivial task. The first step is to visually inspect the estimated FRF. Plot magnitude, phase and coherence all together in one diagram. Some questions to answer:
Does the coherence drop below 0.9 in some areas? If so, dont trust the data in this region. If you can do so, improve the experiment in this case. Narrow areas at resonances with bad coherence may be unavoidable. If there is a zero in the FRF, the coherence at this frequency will also be bad. In all other regions, a drop in the coherence may not be permitted. A carefully designed experiment should yield gamma^2 > 0.95 in all areas except at resonances or zeros. Failure to meet this criterion may indicate nonlinearity or high noise.
Can you see zeros or poles from the data? Look for peaks and valleys in the mangitude or jumps in the phase. If you see anything, write it down and compare it to your final results.
Is there any special structure in the FRF? Are the poles close to each other? Is this an oscillating system? Try to get some insight.
If the structure of the FRF is simple, just do a least square fit. Transform the FRF to a complex representation (FRF = real + i * imag). Then assume that FRF(s) = den(s)/num(s), where numerator and denomninator are polynonmials of chosen order. We obtain FRF*num(s) = den(s). This is linear with respect to the coefficients of num(s) and den(s). Calculate the least squares solution. Do this for different (independent) orders of num and den. Then compute zeros an poles from the polynomials.
If this approach does not succeed you need to understand what the problem is. Software does not help with this important step! Of course once you have some understanding, a more sophisticated approach may yield good results.

Histogram computational efficiency

I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.