I am using Weka's IBk for performing classification on text (tweets). I am converting the training and test data to vector space, and when I am performing the classification on test data, the best result comes from K=1. The training and testing data are separate from each other. Why does K=1 give the best accuracy?
Because you are using vectors; so at k=1 the value you get for proximity (for k=1) is more important than what the common class is in case of k=n (ex: when k=5)
Related
I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer
In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!
I am trying to classify the four groups of images using SVM method, by randomly selecting training and testing data each time. When T run the program the performance varies due to randomly selecting data. How to get accurate performance of my algorithm and also how to calculate training and testing accuracy?
The formula I am using for performance is
Performance = sum(PredictedLabels == test_labels) / numel(PredictedLabels)
I am using multisvm function for classification.
My suggestion:
Actually the performance measure is acceptable, though there are some other slightly better choices like #Dan has mentioned.
More importantly, you need to deal with randomness.
1) Everytime you select your training data, test the trained model with multiple randomized test data and average the accuracy. (e.g. 10 times or so)
2) Use multiple trained model and average the performance to get general performance.
Remark:
1) You need to make sure the training data and test data do not overlap. Or it is no longer test data.
2) It is better to have the training data have the same number of samples from each class label. This means you can partition your dataset in advance.
I'm relatively new to Matlab ANN Toolbox. I am training the NN with pattern recognition and target matrix of 3x8670 containing 1s and 0s, using one hidden layer, 40 neurons and the rest with default settings. When I get the simulated output for new set of inputs, then the values are around 0 and 1. I then arrange them in descending order and choose a fixed number(which is known to me) out of 8670 observations to be 1 and rest to be zero.
Every time I run the program, the first row of the simulated output always has close to 100% accuracy and the following rows dont exhibit the same kind of accuracy.
Is there a logical explanation in general? I understand that answering this query conclusively might require the understanding of program and problem, but its made of of several functions to clearly explain. Can I make some changes in the training to get consistence output?
If you have any suggestions please share it with me.
Thanks,
Nishant
Your problem statement is not clear for me. For example, what you mean by: "I then arrange them in descending order and choose a fixed number ..."
As I understand, you did not get appropriate output from your NN as compared to the real target. I mean, your output from NN is difference than target. If so, there are different possibilities which should be considered:
How do you divide training/test/validation sets for training phase? The most division should be assigned to training (around 75%) and rest for test/validation.
How is your training data set? Can it support most scenarios as you expected? If your trained data set is not somewhat similar to your test data sets (e.g., you have some new records/samples in the test data set which had not (near) appear in the training phase, it explains as 'outlier' and NN cannot work efficiently with these types of samples, so you need clustering approach not NN classification approach), your results from NN is out-of-range and NN cannot provide ideal accuracy as you need. NN is good for those data set training, where there is no very difference between training and test data sets. Otherwise, NN is not appropriate.
Sometimes you have an appropriate training data set, but the problem is training itself. In this condition, you need other types of NN, because feed-forward NNs such as MLP cannot work with compacted and not well-separated regions of data very well. You need strong function approximation such as RBF and SVM.
I want you to help me figure out which problem am I dealing with (pattern recognition or time series forecasting) and find the best NN architecture suited for this problem.
In my problem, I have many finite sets of two dimensional data (learning sets)
Lets N be the size of the data set I want to calculate using the NN.
I want my NN to learn these data and by giving it the first m data of the data set it gives me the remaining N-m data.
I think it's rather a pattern recognition problem, so which is the best NN architecture suited for this kind.
Thank you.
As far as I have understood you problem, you have a dataset with N rows. And you want to train your network using first M rows. And then you want your NN to predict the rest N-M rows.
Typically, in forecasting (timeseries prediction), we do this kind of stuffs. We train our model with historical data and try to predict future values.
So, in your case, top M rows could be training data in the training phase.And during the model accuracy evaluation phase, future values could be your N-M rows.
Typically, recurrent networks are best suited for temporal data, because, they can take care of ordered data.
ENCOG also provides a special dataset for temporal data.And you can use them for your problem.