How to find out which data sets destroy my data analysis using MATLAB? - matlab

I have 200 samples, each of them has 60 features. I use PCA to find the principal components. I use neural network and also try k nearest neighbor However, the classification results are not good. I don't mind to take out some samples, but how I can tell which samples destroy my classification results? I know I can try them one by one, but it would be very ineffective. Please help

Instead of throwing out some samples you need to throw out some attributes.
PCA computes a matrix with d x d entries. At 60 attributes, this matrix has 3600 entries. You have only 200 samples to compute the contents of this matrix - no wonder that the result is pretty much random. You need fewer variables and more data.

This is a classical machine learning problem. There is always a risk with such a high number of features (in your case 60) with only 200 samples. Please check whether you have features which are redundant. Let me give an example
Imagine, we have to predict housing prices from the following features
1. Size in m2
2. Number of bedrooms
3. House age
4. Size in foot2
Please note that here number 2 and number 4 features both gives the same information and they are redundant. At first it does not look that disturbing. But if you have data like that its better to remove those features.
Therefore, i would recommend you to look first in your features and then into data. For more details you have a look in Machine Learning class(by Prof. Ng) from Stanford available in coursera

Related

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

Estimate geo-coordinates based on tweets using sparse coding

I'm trying to estimate Geo-coordinates of tweets on Twitter based only on characteristics of tweets' content. I used an algorithm from this paper.
Basically, the tweets from users are collected and pre-processed to create a sequence/word-count vectors. Sub-vectors (patches) are extracted and an unsupervised learning method is used for learning Dictionary (KSVD). Using the learned Dictionary, sparse codes can be found. After that, a max pooling scheme is introduced. Finally, a look-up table is created with entries of key (sparse code)/value (Geo-coordinates). For estimating Geo-coordinates of tweets (same user), we calculate corresponding sparse codes, then use kNN to find the neighbors. The geo-coordinates can be estimated by the average value of these neighbor vectors.
Following is how I implement the algorithm:
I extracted data from GeoText dataset, separating data by user (say,80% of users for the training set, and 20% for the validation set)
Patches/sub-vectors are put together to create a big matrices of the training set and the validation set
For learning dictionary, I used KSVD-BOX from: http://www.cs.technion.ac.il/~ronrubin/software.html
For sparse coding, I used OMP-BOX from the same aforementioned website
Some necessary parameters:
N = 64 (dimension of patch or sub-vector)
K = 600 (number of atoms)
T = 10 (sparsity)
kNN = 30 (number of the nearest neighbours)
epsilon = 0.1 (whitening constant)
It can be seen that the algorithm runs pretty fast, taking 10 minutes for training, and 5 minutes for testing. However, I've never obtained a high accuracy. In fact, the mean value distance errors is always around 1000 km, which is not as good as in the paper (500km). I've followed every points in the paper, including enhancement options. Here is my Matlab source code.
Well, I know the description is pretty long, but I tried to explain what I understand in an easy way. I hope you can help me improve the accuracy.
Thank for your patient.

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF.
for f= 1:20
f
offset = 0;
c=strcat('running',num2str(f),'.mat');
load(c)
pix=video3Dm;
% Generate descriptors at locations given by subs matrix
for i=1:100
reRun = 1;
while reRun == 1
loc = subs(i+offset,:);
fprintf(1,'Calculating keypoint at location (%d, %d, %d)\n',loc);
% Create a 3DSIFT descriptor at the given location
[keys{i} reRun] = Create_Descriptor(pix,1,1,loc(1),loc(2),loc(3));
if reRun == 1
offset = offset + 1;
end
end
end
fprintf(1,'\nFinished...\n%d points thrown out do to poor descriptive ability.\n',offset);
for t1=1:20
des(t1+((f-1)*100),:)=keys{1,t1}.ivec;
end
f
end
My approach is to first get 50 descriptors(of 640 dimension) for one video, and then perform bag of words with all descriptors(on 50*600= 30000 descriptors). After performing Kmeans(with 1000 k value)
idx1000=kmeans(double(total_des),1000);
I am getting 30k of length index vector. Then I am creating histogram signature of each video based on their index values in clusters. Then perform svmtrain(sum in matlab) on signetures(dim-600*1000).
Some potential problems-
1-I am generating random 300 points in 3D to calculate 50 descriptors on any 50 points from those points 300 points.
2- xy, and time scale values, by default they are "1".
3-Cluster numbers, I am not sure that k=1000 is enough for 30000x640 data.
4-svmtrain, I am using this matlab library.
NOTE: Everything is on MATLAB.
Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two parts. Advice about bag of words features, and advice about SVM classifiers.
Tuning Bag of Words Features
You are using 50 3D SIFT Features per video from randomly selected points with a vocabulary of 1000 visual words. As you've already mentioned, the size of the vocabulary is one parameter you can adjust. So is the number of descriptors per video.
Let's say that each video is 60 frames long, (at 30 fps only 2 sec, but let's assume you are sampling at 1fps for a 1 minute video). That means you are capturing less than one descriptor per frame. That seems very low to me even with 3D descriptors especially if the locations are randomly chosen.
I would manually examine the points for which you are generating features. Do they appear be well distributed in both space and time? Are you capturing too much background? Ask yourself, would I be able to distinguish between actions given these features?
If you find that many of the selected points are uninformative, increasing the number of points may help. The kmeans clustering can make a few groups for uninformative outliers, and more points means you hopefully capture a few more informative points. You can also try other methods for selecting points. For example, you could use corner points.
You can also manually examine the points that are clustered together. What sorts of structures do the groups have in common? Are the clusters too mixed? That's usually a sign that you need a larger vocabulary.
Tuning SVMs
Using the Matlab SVM implementation or the Libsvm implementation should not make a difference. They are both the same method and have similar tuning options.
First off, you should really be using cross-validation to tune the SVM to avoid overfitting on your test set.
The most powerful parameter for the SVM is the kernel choice. In Matlab, there are five built in kernel options, and you can also define your own. The kernels also have parameters of their own. For example, the gaussian kernel has a scaling factor, sigma. Typically, you start off with a simple kernel and compare to more complex kernels. For example, start with linear, then test quadratic, cubic and gaussian. To compare, you can simply look at your mean cross-validation accuracy.
At this point, the last option is to look at individual instances that are misclassified and try to identify reasons that they may be more difficult than others. Are there commonalities such as occlusion? Also look directly at the visual words that were selected for these instances. You may find something you overlooked when you were tuning your features.
Good luck!

Using cross-validation to find the right value of k for the k-nearest-neighbor classifier

I am working on a UCI data set about wine quality. I have applied multiple classifiers and k-nearest neighbor is one of them. I was wondering if there is a way to find the exact value of k for nearest neighbor using 5-fold cross validation. And if yes, how do I apply that? And how can I get the depth of a decision tree using 5-fold CV?
Thanks!
I assume here that you mean the value of k that returns the lowest error in your wine quality model.
I find that a good k can depend on your data. Sparse data might prefer a lower k whereas larger datasets might work well with a larger k. In most of my work, a k between 5 and 10 have been quite good for problems with a large number of cases.
Trial and Error can at times be the best tool here, but it shouldn't take too long to see a trend in the modelling error.
Hope this Helps!

Matlab and Support Vector Machines: Why doesn't the implementation of PCA give good prediction results?

I have a training dataset with 60,000 images and a testing dataset with 10,000 images. Each image represents an integer number from 0 to 9. My goal was to use libsvm which is a library for Support Vector Machines in order to learn the numbers from the training dataset and use the classification produced to predict the images of the testing dataset.
Each image is 28x28 which means that it has 784 pixels or features. While the features seem to be too many it took only 5-10 minutes to run the SVM application and learn the training dataset. The testing results were very good giving me 93% success rate.
I decided to try and use PCA from matlab in order to reduce the amount of features while at the same time not losing too much information.
[coeff scores latent] = princomp(train_images,'econ');
I played with the latent a little bit and found out that the first 90 features would have as a result 10% information loss so I decided to use only the first 90.
in the above code train_images is an array of size [60000x784]
from this code I get the scores and from the scores I simply took the number of features I wanted, so finally I had for the training images an array of [60000x90]
Question 1: What's the correct way to project the testing dataset to the coefficients => coeff?
I tried using the following:
test_images = test_images' * coeff;
Note that the test_images accordingly is an array of size [784x10000] while the coeff an array of size [784x784]
Then from that again I took only the 90 features by doing the following:
test_images = test_images(:,(1:number_of_features))';
which seemed to be correct. However after running the training and then the prediction, I got a 60% success rate which is way lower than the success rate I got when I didn't use any PCA at all.
Question 2: Why did I get such low results?
After PCA I scaled the data as always which is the correct thing to do I guess. Not scaling is generally not a good idea according to the libsvm website so I don't think that's an issue here.
Thank you in advance
Regarding your first question, I believe MarkV has already provided you with an answer.
As for the second question: PCA indeed conserves most of the variance of your data, but it does not necessarily means that it maintains 90% of the information of your data. Sometimes, the information required for successful classification is actually located at these 10% you knocked off. A good example for this can be found here, especially figure 1 there.
So, if you have nice results with the full features, why reduce the dimension?
You might want to try and play with different principal components. What happens if you take components 91:180 ? that might be an interesting experiment...