I'm trying to estimate Geo-coordinates of tweets on Twitter based only on characteristics of tweets' content. I used an algorithm from this paper.
Basically, the tweets from users are collected and pre-processed to create a sequence/word-count vectors. Sub-vectors (patches) are extracted and an unsupervised learning method is used for learning Dictionary (KSVD). Using the learned Dictionary, sparse codes can be found. After that, a max pooling scheme is introduced. Finally, a look-up table is created with entries of key (sparse code)/value (Geo-coordinates). For estimating Geo-coordinates of tweets (same user), we calculate corresponding sparse codes, then use kNN to find the neighbors. The geo-coordinates can be estimated by the average value of these neighbor vectors.
Following is how I implement the algorithm:
I extracted data from GeoText dataset, separating data by user (say,80% of users for the training set, and 20% for the validation set)
Patches/sub-vectors are put together to create a big matrices of the training set and the validation set
For learning dictionary, I used KSVD-BOX from: http://www.cs.technion.ac.il/~ronrubin/software.html
For sparse coding, I used OMP-BOX from the same aforementioned website
Some necessary parameters:
N = 64 (dimension of patch or sub-vector)
K = 600 (number of atoms)
T = 10 (sparsity)
kNN = 30 (number of the nearest neighbours)
epsilon = 0.1 (whitening constant)
It can be seen that the algorithm runs pretty fast, taking 10 minutes for training, and 5 minutes for testing. However, I've never obtained a high accuracy. In fact, the mean value distance errors is always around 1000 km, which is not as good as in the paper (500km). I've followed every points in the paper, including enhancement options. Here is my Matlab source code.
Well, I know the description is pretty long, but I tried to explain what I understand in an easy way. I hope you can help me improve the accuracy.
Thank for your patient.
I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer
I am using the https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information).
Actually I have two Nx1 vectors as inputs, one is the actual label while another is the label assignments. I basically check each of every element insides and I found that even I have 82% rand_index, the NMI is only 0.3209. Below is the example for Iris Dataset https://archive.ics.uci.edu/ml/datasets/iris with MATLAB built-in K-Means.
data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);
NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');
The result:
Auto ACC: 0.820000
Rand_Index: 0.701818
NMI: 0.320912
The Rand Index will tend towards 1 as the number of data points increases (even when comparing random clusterings) so you never really expect to see small values of Rand when you have a big data set.
At the same time, Accuracy can be high when all of your points fall into the same large cluster.
I have a feeling that the NMI is producing a more reliable comparison. To verify, trying running a dimensionality reduction and plot the data points with color based on the two clusterings. Visual statistics are often the best for developing an intuition about data.
If you want to explore more, a convenient python package for clustering comparisons is CluSim.
I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF.
for f= 1:20
offset = 0;
% Generate descriptors at locations given by subs matrix
for i=1:100
reRun = 1;
while reRun == 1
loc = subs(i+offset,:);
fprintf(1,'Calculating keypoint at location (%d, %d, %d)\n',loc);
% Create a 3DSIFT descriptor at the given location
[keys{i} reRun] = Create_Descriptor(pix,1,1,loc(1),loc(2),loc(3));
if reRun == 1
offset = offset + 1;
fprintf(1,'\nFinished...\n%d points thrown out do to poor descriptive ability.\n',offset);
for t1=1:20
My approach is to first get 50 descriptors(of 640 dimension) for one video, and then perform bag of words with all descriptors(on 50*600= 30000 descriptors). After performing Kmeans(with 1000 k value)
I am getting 30k of length index vector. Then I am creating histogram signature of each video based on their index values in clusters. Then perform svmtrain(sum in matlab) on signetures(dim-600*1000).
Some potential problems-
1-I am generating random 300 points in 3D to calculate 50 descriptors on any 50 points from those points 300 points.
2- xy, and time scale values, by default they are "1".
3-Cluster numbers, I am not sure that k=1000 is enough for 30000x640 data.
4-svmtrain, I am using this matlab library.
NOTE: Everything is on MATLAB.
Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two parts. Advice about bag of words features, and advice about SVM classifiers.
Tuning Bag of Words Features
You are using 50 3D SIFT Features per video from randomly selected points with a vocabulary of 1000 visual words. As you've already mentioned, the size of the vocabulary is one parameter you can adjust. So is the number of descriptors per video.
Let's say that each video is 60 frames long, (at 30 fps only 2 sec, but let's assume you are sampling at 1fps for a 1 minute video). That means you are capturing less than one descriptor per frame. That seems very low to me even with 3D descriptors especially if the locations are randomly chosen.
I would manually examine the points for which you are generating features. Do they appear be well distributed in both space and time? Are you capturing too much background? Ask yourself, would I be able to distinguish between actions given these features?
If you find that many of the selected points are uninformative, increasing the number of points may help. The kmeans clustering can make a few groups for uninformative outliers, and more points means you hopefully capture a few more informative points. You can also try other methods for selecting points. For example, you could use corner points.
You can also manually examine the points that are clustered together. What sorts of structures do the groups have in common? Are the clusters too mixed? That's usually a sign that you need a larger vocabulary.
Tuning SVMs
Using the Matlab SVM implementation or the Libsvm implementation should not make a difference. They are both the same method and have similar tuning options.
First off, you should really be using cross-validation to tune the SVM to avoid overfitting on your test set.
The most powerful parameter for the SVM is the kernel choice. In Matlab, there are five built in kernel options, and you can also define your own. The kernels also have parameters of their own. For example, the gaussian kernel has a scaling factor, sigma. Typically, you start off with a simple kernel and compare to more complex kernels. For example, start with linear, then test quadratic, cubic and gaussian. To compare, you can simply look at your mean cross-validation accuracy.
At this point, the last option is to look at individual instances that are misclassified and try to identify reasons that they may be more difficult than others. Are there commonalities such as occlusion? Also look directly at the visual words that were selected for these instances. You may find something you overlooked when you were tuning your features.
Good luck!
I have a training dataset with 60,000 images and a testing dataset with 10,000 images. Each image represents an integer number from 0 to 9. My goal was to use libsvm which is a library for Support Vector Machines in order to learn the numbers from the training dataset and use the classification produced to predict the images of the testing dataset.
Each image is 28x28 which means that it has 784 pixels or features. While the features seem to be too many it took only 5-10 minutes to run the SVM application and learn the training dataset. The testing results were very good giving me 93% success rate.
I decided to try and use PCA from matlab in order to reduce the amount of features while at the same time not losing too much information.
[coeff scores latent] = princomp(train_images,'econ');
I played with the latent a little bit and found out that the first 90 features would have as a result 10% information loss so I decided to use only the first 90.
in the above code train_images is an array of size [60000x784]
from this code I get the scores and from the scores I simply took the number of features I wanted, so finally I had for the training images an array of [60000x90]
Question 1: What's the correct way to project the testing dataset to the coefficients => coeff?
I tried using the following:
test_images = test_images' * coeff;
Note that the test_images accordingly is an array of size [784x10000] while the coeff an array of size [784x784]
Then from that again I took only the 90 features by doing the following:
test_images = test_images(:,(1:number_of_features))';
which seemed to be correct. However after running the training and then the prediction, I got a 60% success rate which is way lower than the success rate I got when I didn't use any PCA at all.
Question 2: Why did I get such low results?
After PCA I scaled the data as always which is the correct thing to do I guess. Not scaling is generally not a good idea according to the libsvm website so I don't think that's an issue here.
Thank you in advance
Regarding your first question, I believe MarkV has already provided you with an answer.
As for the second question: PCA indeed conserves most of the variance of your data, but it does not necessarily means that it maintains 90% of the information of your data. Sometimes, the information required for successful classification is actually located at these 10% you knocked off. A good example for this can be found here, especially figure 1 there.
So, if you have nice results with the full features, why reduce the dimension?
You might want to try and play with different principal components. What happens if you take components 91:180 ? that might be an interesting experiment...
I have 200 samples, each of them has 60 features. I use PCA to find the principal components. I use neural network and also try k nearest neighbor However, the classification results are not good. I don't mind to take out some samples, but how I can tell which samples destroy my classification results? I know I can try them one by one, but it would be very ineffective. Please help
Instead of throwing out some samples you need to throw out some attributes.
PCA computes a matrix with d x d entries. At 60 attributes, this matrix has 3600 entries. You have only 200 samples to compute the contents of this matrix - no wonder that the result is pretty much random. You need fewer variables and more data.
This is a classical machine learning problem. There is always a risk with such a high number of features (in your case 60) with only 200 samples. Please check whether you have features which are redundant. Let me give an example
Imagine, we have to predict housing prices from the following features
1. Size in m2
2. Number of bedrooms
3. House age
4. Size in foot2
Please note that here number 2 and number 4 features both gives the same information and they are redundant. At first it does not look that disturbing. But if you have data like that its better to remove those features.
Therefore, i would recommend you to look first in your features and then into data. For more details you have a look in Machine Learning class(by Prof. Ng) from Stanford available in coursera