why as number of features increases, the classification accuracy decreases when using svm - matlab

I am using libsvm for image classification. Why when I use more features for classification my prediction accuracy decreases? Shouldn´t it increase? My dataset size is fixed at 1600 for training and 400 for testing.

Because the additional features may not be at all useful for separating the classes in the feature space. Accuracy is not necessarily tied to number of features.
Including lots of poor features may cause your SVM to learn the noise in the data, damaging the accuracy.
For example, if your extra feature looks like this (using 2D plots for clarity):
Then it will not be a very good feature for separating the (in this case) two classes. If for example, the SVM trains only on this pattern, it will not be very good at predicting the class of a future point. However, there might be a feature in your dataset that looks like this:
A feature like this would be very useful in separating the two classes.

Related

When to use PCA for dimensionality reduction?

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Can KNN be better than other classifiers?

As Known, there are classifiers that have a training or a learning step, like SVM or Random Forest. On the other hand, KNN does not have.
Can KNN be better than these classifiers?
If no, why?
If yes, when, how and why?
The main answer is yes, it can due to no free lunch theorem implications. FLT can be loosley stated as (in terms of classification)
There is no universal classifier which is consisntenly better at any task than others
It can also be (not very strictly) inverted
For each (well defined) classifier there exists a dataset where it is the best one
And in particular - kNN is well-defined classifier, in particular it is consistent with any distibution, which means that given infinitely many training points it converges to the optimal, Bayesian separator.
So can it be better than SVM or RF? Obviously! When? There is no clear answer. First of all in supervised learning you often actually get just one training set and try to fit the best model. In such scenario any model can be the best one. When statisticians/theoretical ML try to answer whether one model is better than another, we actually try to test "what would happen if we would have ifinitely many training sets" - so we look at the expected value of the behaviour of the classifiers. In such setting, we often show that SVM/RF is better than KNN. But it does not mean that they are always better. It only means, that for randomly selected dataset you should expect KNN to work worse, but this is only probability. And as you can always win in a lottery (no matter the odds!) you can also always win with KNN (just to be clear - KNN has bigger chances of being a good model than winning a lottery :-)).
What are particular examples? Let us for example consider a rotated XOR problem.
If the true decision boundaries are as above, and you only have this four points. Obviously 1NN will be much better than SVM (with dot, poly or rbf kernel) or RF. It should also be true once you include more and more training points.
"In general kNN would not be expected to exceed SVM or RF. When kNN does, that says something very interesting about the training data. If many doublets are present i the data set, a nearest neighbor algorithm works very well."
I heard the argument something like as written by Claudia Perlich in this podcast:
http://www.thetalkingmachines.com/blog/2015/6/18/working-with-data-and-machine-learning-in-advertizing
My intuitive understanding of why RF and SVM is better kNN in generel: All algorithms basicly assume some local similarity, such that samples very alike gets classified alike. kNN can only choose the most similar samples by distance(or some other global kernel). So the samples which could influence a prediction on kNN would exists within a hyper sphere for the Euclidean distance kernel. RF and SVM can learn other definitions of locality which could stretch far by some features and short by others. Also the propagation of locality could take up many learned shapes, and these shapes can differ through out the feature space.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results