Choose the right classification algorithm. Linear or non-linear? [closed] - classification

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I find this question a little tricky. Maybe someone knows an approach to answer this question. Imagine that you have a dataset(training data) which you don't know what it is about. Which features of training data would you look at in order to infer classification algorithm to classify this data? Can we say anything whether we should use a non-linear or linear classification algorithm?
By the way, I am using WEKA to analyze the data.
Any suggestions?
Thank you.

This is in fact two questions in one ;-)
Feature selection
Linear or not
add "algorithm selection", and you probably have three most fundamental questions of classifier design.
As an aside note, it's a good thing that you do not have any domain expertise which would have allowed you to guide the selection of features and/or to assert the linearity of the feature space. That's the fun of data mining : to infer such info without a priori expertise. (BTW, and while domain expertise is good to double-check the outcome of the classifier, too much a priori insight may make you miss good mining opportunities). Without any such a priori knowledge you are forced to establish sound methodologies and apply careful scrutiny to the results.
It's hard to provide specific guidance, in part because many details are left out in the question, and also because I'm somewhat BS-ing my way through this ;-). Never the less I hope the following generic advice will be helpful
For each algorithm you try (or more precisely for each set of parameters for a given algorithm), you will need to run many tests. Theory can be very helpful, but there will remain a lot of "trial and error". You'll find Cross-Validation a valuable technique.
In a nutshell, [and depending on the size of the available training data], you randomly split the training data in several parts and train the classifier on one [or several] of these parts, and then evaluate the classifier on its performance on another [or several] parts. For each such run you measure various indicators of performance such as Mis-Classification Error (MCE) and aside from telling you how the classifier performs, these metrics, or rather their variability will provide hints as to the relevance of the features selected and/or their lack of scale or linearity.
Independently of the linearity assumption, it is useful to normalize the values of numeric features. This helps with features which have an odd range etc.
Within each dimension, establish the range within, say, 2.5 standard deviations on either side of the median, and convert the feature values to a percentage on the basis of this range.
Convert nominal attributes to binary ones, creating as many dimensions are there are distinct values of the nominal attribute. (I think many algorithm optimizers will do this for you)
Once you have identified one or a few classifiers with a relatively decent performance (say 33% MCE), perform the same test series, with such a classifier by modifying only one parameter at a time. For example remove some features, and see if the resulting, lower dimensionality classifier improves or degrades.
The loss factor is a very sensitive parameter. Try and stick with one "reasonnable" but possibly suboptimal value for the bulk of the tests, fine tune the loss at the end.
Learn to exploit the "dump" info provided by the SVM optimizers. These results provide very valuable info as to what the optimizer "thinks"
Remember that what worked very well wih a given dataset in a given domain may perform very poorly with data from another domain...
coffee's good, not too much. When all fails, make it Irish ;-)

Wow, so you have some training data and you don't know whether you are looking at features representing words in a document, or genese in a cell and need to tune a classifier. Well, since you don't have any semantic information, you are going to have to do this soley by looking at statistical properties of the data sets.
First, to formulate the problem, this is more than just linear vs non-linear. If you are really looking to classify this data, what you really need to do is to select a kernel function for the classifier which may be linear, or non-linear (gaussian, polynomial, hyperbolic, etc. In addition each kernel function may take one or more parameters that would need to be set. Determining an optimal kernel function and parameter set for a given classification problem is not really a solved problem, there are only useful heuristics and if you google 'selecting a kernel function' or 'choose kernel function', you will be treated to many research papers proposing and testing various approaches. While there are many approaches, one of the most basic and well travelled is to do a gradient descent on the parameters-- basically you try a kernel method and a parameter set , train on half your data points and see how you do. Then you try a different set of parameters and see how you do. You move the parameters in the direction of best improvement in accuracy until you get satisfactory results.
If you don't need to go through all this complexity to find a good kernel function, and simply want an answer to linear or non-linear. then the question mainly comes down to two things: Non linear classifiers will have a higher risk of overfitting (undergeneralizing) since they have more dimensions of freedom. They can suffer from the classifier merely memorizing sets of good data points, rather than coming up with a good generalization. On the other hand a linear classifier has less freedom to fit, and in the case of data that is not linearly seperable, will fail to find a good decision function and suffer from high error rates.
Unfortunately, I don't know a better mathematical solution to answer the question "is this data linearly seperable" other than to just try the classifier itself and see how it performs. For that you are going to need a smarter answer than mine.
Edit: This research paper describes an algorithm which looks like it should be able to determine how close a given data set comes to being linearly seperable.
http://www2.ift.ulaval.ca/~mmarchand/publications/wcnn93aa.pdf

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

How does window affect accuracy of skip-gram?

I would like to know how the window_size of skip-gram model affect the accuracy of predicting similar words in word embedding. Under what cases can accuracy drop and rise.
Thanks.
While the training process tries to incrementally improve a model's relative activation for a center target word, after being input nearby words, the exact prediction accuracy is never evaluated during training, nor is a model with better accuracy necessarily "better" than one with a lower accuracy.
The training as-if you wanted to predict the center word is just a useful thing for creating sets of word-vectors with useful relative arrangements.
So for most uses of word2vec, you question isn't really relevant. You shouldn't pick the window value that leads to the best accuracy on the word2vec internal training task; you should pick a window that gives the best word-vectors for whatever downstream use you're considering.
(If for some academic/curiosity reason, you needed to know the answer to your question, you'd need to run an experiment: try many window values, then ask the model to predict words, and compare the accuracy. Note, though, that many word2vec implementations don't even provide an API for individual word-prediction, as that's not needed for training nor most downstream uses.)

How to interpret the discriminator's loss and the generator's loss in Generative Adversarial Nets?

I am reading people's implementation of DCGAN, especially this one in tensorflow.
In that implementation, the author draws the losses of the discriminator and of the generator, which is shown below (images come from https://github.com/carpedm20/DCGAN-tensorflow):
Both the losses of the discriminator and of the generator don't seem to follow any pattern. Unlike general neural networks, whose loss decreases along with the increase of training iteration. How to interpret the loss when training GANs?
Unfortunately, like you've said for GANs the losses are very non-intuitive. Mostly it happens down to the fact that generator and discriminator are competing against each other, hence improvement on the one means the higher loss on the other, until this other learns better on the received loss, which screws up its competitor, etc.
Now one thing that should happen often enough (depending on your data and initialisation) is that both discriminator and generator losses are converging to some permanent numbers, like this:
(it's ok for loss to bounce around a bit - it's just the evidence of the model trying to improve itself)
This loss convergence would normally signify that the GAN model found some optimum, where it can't improve more, which also should mean that it has learned well enough. (Also note, that the numbers themselves usually aren't very informative.)
Here are a few side notes, that I hope would be of help:
if loss haven't converged very well, it doesn't necessarily mean that the model hasn't learned anything - check the generated examples, sometimes they come out good enough. Alternatively, can try changing learning rate and other parameters.
if the model converged well, still check the generated examples - sometimes the generator finds one/few examples that discriminator can't distinguish from the genuine data. The trouble is it always gives out these few, not creating anything new, this is called mode collapse. Usually introducing some diversity to your data helps.
as vanilla GANs are rather unstable, I'd suggest to use some version
of the DCGAN models, as they contain some features like convolutional
layers and batch normalisation, that are supposed to help with the
stability of the convergence. (the picture above is a result of the DCGAN rather than vanilla GAN)
This is some common sense but still: like with most neural net structures tweaking the model, i.e. changing its parameters or/and architecture to fit your certain needs/data can improve the model or screw it.

Combining an image classifier and an expert system

Would it be accurate to include an expert system in an image classifying application? (I am working with Matlab, have some experience with image processing and no experience with expert systems.)
What I'm planning on doing is adding an extra feature vector that is actually an answer to a question. Is this fine?
For example: Assume I have two questions that I want the answers to : Question 1 and Question 2. Knowing the answers to these 2 questions should help classify the test image more accurately. I understand expert systems are coded differently from an image classifier but my question is would it be wrong to include the answers to these 2 questions, in a numerical form (1 can be yes, and 0 can be no) and pass this information along with the other feature vectors into a classifier.
If it matters, my current classifier is an SVM.
Regarding training images: yes, they too will be trained with the 2 extra feature vectors.
Converting a set of comments to an answer:
A similar question in cross-validated already explains that it can be done as long as data is properly preprocessed.
In short: you can combine them as long as training (and testing) data is properly preprocessed (e.g. standardized). Standardization improves the performance of most linear classifiers because it scales the variables so they have the similar weight in the learning process and improves the numerical stability (and performance) when variables are sampled from gaussian-like distributions (which is achieved by standarization).
With that, if continuous variables are standardized and categorical variables are encoded as (-1, +1) the SVM should work well. Whether it will improve or not the performance of the classifier depends on the quality of those cathegorical variables.
Answering the other question in the comment.. while using kernel SVM with for example a chi square kernel, the rows of the training data are suppose to behave like histograms (all positive and usually l1-normalized) and therefore introducing a (-1, +1) feature breaks the kernel. Using a RBF kernel the rows of the data are suppose to be L2 normalized, and again, introducing (-1, +1) features might introduce unexpected behaviour (I'm not very sure what exactly the effect would be..).
I worked on similar problem. if multiple features can be extracted from your images then you can train different classifier by using different features. You can think about these classifiers as experts in answering questions based on the features they used in training. Instead of using labels as outputs, it is better to use confidence values. uncertainty can be very important in this manner. you can use these experts to generate values. these values can be combined and used to train another classifier.

How to improve accuracy of decision tree in matlab

I have a set of data which I classify them in matlab using decision tree. I divide the set into two parts; one training data(85%) and the other test data(15%). The problem is that the accuracy is around %90 and I do not know how I can improve it. I would appreciate if you have any idea about it.
Decision trees might be performing low because of many reasons, one prominent reason which I can think of is that while calculating a split they do not consider inter-dependency of variables or of target variable on other variables.
Before going into improving the performance, one should be aware that it shall not cause over-fitting and shall be able to generalize.
To improve performance these few things can be done:
Variable preselection: Different tests can be done like multicollinearity test, VIF calculation, IV calculation on variables to select only a few top variables. This will lead in improved performance as it would strictly cut out the undesired variables.
Ensemble Learning Use multiple trees (random forests) to predict the outcomes. Random forests in general perform well than a single decision tree as they manage to reduce both bias and variance. They are less prone to overfitting as well.
K-Fold cross validation: Cross validation in the training data itself can improve the performance of the model a bit.
Hybrid Model: Use a hybrid model, i.e. use logistic regression after using decision trees to improve performance.
I guess the more important question here is what's a good accuracy for the given domain: if you're classifying spam then 90% might be a bit low, but if you're predicting stock prices then 90% is really high!
If you're doing this on a known domain set and there are previous examples of classification accuracy which is higher than yours, then you can try several things:
K-Fold Cross Validation
Ensamble Learning
Generalized Iterative Scaling (GIS)
Logistic Regression
I don't think you should improve this, may be the data is overfitted by the classifier. Try to use another data sets, or cross-validation to see the more accurate result.
By the way, 90%, if not overfitted, is great result, may be you even don't need to improve it.
You could look into pruning the leaves to improve the generalization of the decision tree. But as was mentioned, 90% accuracy can be considered quite good..
90% is good or bad, depends on the domain of the data.
However, it might be that the classes in your data are overlapping and you can't really do more than 90%.
You can try to look in what nodes are the errors, and check if it's possible to improve the classification by changing them.
You can also try Random Forest.