I'm currently using Matlab's k nearest neighbors classifier (knnclassify) to train and test binary attributes. The default value argument for k if none provided is 1 and one can choose other values of k. I've done research online and on stackoverflow but nothing relevant came up to address my question for what value of k would be of best use. Is there a built in function that can tell me that for my particular data or is it simply guess and wait to see what accuracy is derived? Any help will be greatly appreciated.
Here is the link to matlab's knnclassify documentation: knnclassify
What you have here is a typical model selection problem. What you want is to pick the k that gives you the lowest overall error on your data. Larger values of k generalize better, and smaller values may tend to overfit.
Hence, cross-validation is a good way to choose this parameter and I found the this article, which seems like a reasonable method.
Related
I'm building a Neural Network from scratch that categorizes values of x into 21 possible estimates for sin(x). I plan to use MSE as my loss function.
MSE of each minibatch = ||y(x) - a||^2, where y(x) is the vector of network outputs for x-
values in the minibatch. a is the vector of expected outputs that correspond to each x.
After finding the loss, the column vector of all weights in the network is recalculated. Column vector of delta w's ~= column vector of partial derivatives of C with respect to each weight.
∇C≡(∂C/∂w1,∂C/∂w2...).T and Δw =−η∇C where η is the (positive) learn rate.
The problem is, to find the gradient of C, you have to differentiate with respect to each weight. What does that function even look like? It's not just the previously stated MSE right?
Any help is appreciated. Also, apologies in advance if this question is misplaced, I wasn't sure if it belonged here or in a math forum.
Thank you.
(I might add that I have tried to find an answer to this online, but few examples exist that either don't use libraries to do the dirty work or present the information clearly.)
http://neuralnetworksanddeeplearning.com/chap2.html
I had found this a while ago but only now realized it's significance. The link describes δ(j,l) as an intermediary value to arrive at the partial derivative of C with respect to weights. I will post back here with a full answer if the link above answers my question, as I've seen a few posts similar to mine that have yet to be answered.
Does anyone know how to tell the difference between distributions (ie their goodness of fit) using the dfittool in Matlab? In a class I took forever ago, we learned about the log likelihood parameter and how to compare a pdf fitted to Gaussian vs gamma, etc. But right now, all the matlab help files online are like "it means something." Any assistance would be appreciated. Basically, I need to interpret the "results" in "edit fit" of the dfittool. I want to be able to compare my dfits to each other from the results, so I can pick the best fit for my analysis. I don't know what the difference is between a log likelihood of -111 vs -105.
Example below:
Distribution: Normal
Log likelihood: -110.954
Domain: -Inf < y < Inf
Mean: 101.443
Variance: 436.332
Parameter Estimate Std. Err.
mu 101.443 4.17771
sigma 20.8886 3.04691
Estimated covariance of parameter estimates:
mu sigma
mu 17.4533 6.59643e-15
sigma 6.59643e-15 9.28366
Thank you!
(Log) likelihood is a measure of the fit of a distribution to data, so the simple answer is: the distribution with the largest likelihood is the one that fits best. However, what you get here as an output is the maximized likelihood, i.e. the likelihood at those parameter values where it is maximal. Different families of distributions might be differently "flexible", so that it is easier to get a larger likelihood with one of them in general, so this limits comparability. This holds especially if you compare families with different numbers of parameters. A fix for this is to use formal model comparison, e.g. using the Bayes factor, which however is considerably more complex mathematically, or its approximation, the Bayesian information criterion.
More generally speaking however, it is seldomly a good idea to just randomly pick distributions and see how well they fit. It would be better to have some at least partially theoretically motivated idea why a distribution is a candidate. On the most basic level this means considering its definition range: the normal distribution is defined on the whole real line, the gamma distribution only for nonnegative real numbers. This way it should be possible to rule one of them out based on basic properties of your data.
I am working on a UCI data set about wine quality. I have applied multiple classifiers and k-nearest neighbor is one of them. I was wondering if there is a way to find the exact value of k for nearest neighbor using 5-fold cross validation. And if yes, how do I apply that? And how can I get the depth of a decision tree using 5-fold CV?
Thanks!
I assume here that you mean the value of k that returns the lowest error in your wine quality model.
I find that a good k can depend on your data. Sparse data might prefer a lower k whereas larger datasets might work well with a larger k. In most of my work, a k between 5 and 10 have been quite good for problems with a large number of cases.
Trial and Error can at times be the best tool here, but it shouldn't take too long to see a trend in the modelling error.
Hope this Helps!
I sometimes get -Inf or NaN as the final value of my target function when I am using matlab ga toolbox doing the minimization. But if I do the optimization again with exactly the same option set up, I get a finite answer... Could anyone tell me why this is the case? and how could I solve the problem? Thanks very much!
The documentation and examples for ga are bad about this and barely mention the stochastic nature of this method (though if you're using it maybe you would be aware). If you wish to have repeatable results, you should always specify a seed value when perform stochastic simulations. This can be done in at least two ways. You can use the rng function:
rng(0);
where 0 is the seed value. Or you can possibly use the 'rngstate' field if you specify the optimization as a problem structure. See more here on reproducing results.
If you're doing any sort of experiments you should be specifying a seed. That way you can repeat a run if necessary to check why something may have happened or to obtain more finely-grained data. Just change the seed value to another positive integer if you want to run again.
The Genetic Algorithm is a stochastic algorithm, which means it does not explore the same problem space every time you run it. On each run it will be trying different solutions, and occasionally it is running into a solution on which your target function is ill-behaved.
Without knowing more about your specific problem, all I can really suggest is that you take a closer look at your target function and see if you can restrict it so that it does not explode to negative infinity. Look at the solution returned by the GA when you get these crazy target values, and see if you can adjust your target function so that it does not return infinite values for such solutions.
I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.