I need to use KNN in matlab to find the closest data in training data from A.
I have data in .mat that has this kind of information (training data):
train_data = 1 232 34 21 0.542
2 32 333 542 0.32
and so on.
Then i have a second information that I will gather through the application but I will only get
A = 2 343 543 43 0.23
So now my question is do i only need to do is something like this, and can i use something like this?
Does KNN need to learn something or do you only need to load test data and some present data (like A) and go through some formula or preload in another function that learns it then through a second function to give you the result.
Best regards.
So you have a training set (with labels) and some test data without labels? I think you can use the function you linked to classificationknn(). If i understand your question you want something like the example: Predict Classification Based on a KNN Classifier
http://www.mathworks.se/help/stats/classification-using-nearest-neighbors.html#btap7nm
Related
For each input I have, I have a 49x2 matrix associated. Here's what 1 input-output couple looks like
input :
[Car1, Car2, Car3 ..., Car118]
output :
[[Label1 Label2]
[Label1 Label2]
...
[Label1 Label2]]
Where both Label1 and Label2 are LabelEncode and they have respectively 1200 and 1300 different classes.
Just to make sure this is what we call a multi-output multi-class problem?
I tried to flatten the output but I feared the model wouldn't understand that all similar Label share the same classes.
Is there a Keras layer that handle output this peculiar array shape?
Generally, multi-class problems correspond with models outputting a probability distribution over the set of classes (that is typically scored against the one-hot encoding of the actual class through cross-entropy). Now, independently of whether you are structuring it as one single output, two outputs, 49 outputs or 49 x 2 = 98 outputs, that would mean having 1,200 x 49 + 1,300 x 49 = 122,500 output units - which is not something a computer cannot handle, but maybe not the most convenient thing to have. You could try having each class output to be a single (e.g. linear) unit and round it's value to choose the label, but, unless the labels have some numerical meaning (e.g. order, sizes, etc.), that is not likely to work.
If the order of the elements in the input has some meaning (that is, shuffling it would affect the output), I think I'd approach the problem through an RNN, like an LSTM or a bidirectional LSTM model, with two outputs. Use return_sequences=True and TimeDistributed Dense softmax layers for the outputs, and for each 118-long input you'd have 118 pairs of outputs; then you can just use temporal sample weighting to drop, for example, the first 69 (or maybe do something like dropping the 35 first and the 34 last if you're using a bidirectional model) and compute the loss with the remaining 49 pairs of labellings. Or, if that makes sense for your data (maybe it doesn't), you could go with something more advanced like CTC (although Keras does not have it, I'm trying to integrate TensorFlow implementation into it without much sucess), which is also implemented in Keras (thanks #indraforyou)!.
If the order in the input has no meaning but the order of the outputs does, then you could have an RNN where your input is the original 118-long vector plus a pair of labels (each one-hot encoded), and the output is again a pair of labels (again two softmax layers). The idea would be that you get one "row" of the 49x2 output on each frame, and then you feed it back to the network along with the initial input to get the next one; at training time, you would have the input repeated 49 times along with the "previous" label (an empty label for the first one).
If there are no sequential relationships to exploit (i.e. the order of the input and the output do not have a special meaning), then the problem would only be truly represented by the initial 122,500 output units (plus all the hidden units you may need to get those right). You could also try some kind of middle ground between a regular network and a RNN where you have the two softmax outputs and, along with the 118-long vector, you include the "id" of the output that you want (e.g. as a 49-long one-hot encoded vector); if the "meaning" of each label at each of the 49 outputs is similar, or comparable, it may work.
I have training set of size 54 * 65536 and a testing set of 18 * 65536.
I want to use a knn classifier, but I have some questions:
1) How should I define trainlabel?
Class = knnclassify(TestVec,TrainVec, TrainLabel,k);
Is it a vector of size 54 * 1 that defines to which group each row in training set belongs? Here the group is numbered as 1 ,2,..
2) To find the accuracy I used this:
cp = classperf(TrainLabel);
Class = knnclassify(TestVec,TrainVec, TrainLabel);
cp = classperf(TestLabel,Class);
cp.CorrectRate*100
Is this right? Is there another method to calculate it?
3) How can I enhance the accuracy?
4) How do I choose the best value of k?
I do not know matlab nor the implementation of the knn you are providing, so I can answer only a few of your questions.
1) You assumption is correct. trainlabel is a 54*1 vector or an array of size 54 or something equivalent that defines which group each datapoint (row) in training set belongs to.
2) ... MATLAB / implementation related, sorry.
3) That is a very big discussion. Possible ways are:
Choose a better value of K.
Preprocess the data (or make preprocessing better if already applied).
Get a better / bigger trainset.
to name a few...
4) You can use different values while measuring the accuracy for each one and keep the best. (Note: If you do that, make sure you do not measure the accuracy of the classifier per value of k only once, but rather you use some technique like 10-Folding or something).
There is more than a fair chance that the library you are using for the K-NNclassifier provides such utilities.
I've been given some bacteria data and I'm supposed to use neural networks to classify the bacteria as belonging to Group A or Group B.
The bacteria dataset I've been given looks like this. There are 18 .mat Matlab datasets which are as follows: A1.mat, A2.mat, A3.mat, A4.mat, A5.mat, A6.mat, A7.mat, A8.mat, A9.mat, B1.mat, B2.mat, B3.mat, B4.mat, B5.mat, B6.mat, B7.mat, B8.mat, B9.mat.
Each of these Matlab dataset consists of a 2510 x 2 matrix. The first column is the time information and the second column is some bacteria information. I extracted only the bacteria information in column 2 between indices 900 and 1200. That was the portion I needed for my analysis. This yielded a 209 x 1 matrix.
I went on to create my input data as an 209 x 18 matrix, i.e., extracting data between 900 and 1200 indices for each of the datasets and putting everything together.
My goal in this project is to classify bacteria as belonging to Group A or Group B. From this point on, I'm at a loss on how to get the target values I need to feed into the neural network. Do I need additional information in order to proceed? That is, should the dataset have also contained target information as well? Any help at this point would be helpful. Thanks.
It sounds like you have 418 total exemplars, each with 9 features, with 209 belonging to Group A and 209 belonging to group B. For what it's worth, you'd typically want to have many, many more exemplars to train a neural network.
Instead of thinking of your classification problem as A or B, think about it as 'A' or 'not A.' So exemplars belonging to Group A have a target value of 1, and exemplars belonging to group B have a target value of 0.
I'm hoping there is a MATLAB function similar to this Arduino function:
http://arduino.cc/en/Reference/map
Basically I have a time based variable with 67 data points ranging from 0 to 1.15, and I want to map that from 0 to 100% (so, 101 data points). In Arduino that would look something like:
map(value, fromLow, fromHigh, toLow, toHigh)
I can use interp1 in MATLAB to get me the 101 data points, but I just get 101 data points between 0 and 1.15. I know I can just multiply each value by 100/1.15, but this is inexact. Is there a more elegant way to do this in MATLAB that I'm overlooking?
(This post looked hopeful, but it's not what I'm looking for:
Map function in MATLAB?)
Thanks
If you have neural networks toolbox available, then you can try mapminmax function. By default, function maps to [-1 1] interaval and gets input bounds from data. But I believe that filling settings structure with your values and then calling mapminmax should help.
you can use linspace, for example
linspace(0,1.15,101)
will get you 101 points spread uniformly between the limits 0 and 1.15.
My FEX submission maptorange can do exactly that. It takes initial value(s), the range from which they originate, and the range onto which they should be mapped, and returns the mapped value(s). In your example, that would be:
maptorange(values, [0 1.15], [0 100]);
(This is assuming linear mapping. The script can also map along an exponential function.)
To go from 67 to 101 values, you would indeed need interpolation. This can be done either before or after mapping.
I'm struggling to figure out how exactly to begin using SVD with a MovieLens/Netflix type data set for rating predictions. I'd very much appreciate any simple samples in python/java, or basic pseudocode of the process involved. There are a number of papers/posts that summarise the overall concept but I'm not sure how to begin implementing it, even using a number of the suggested libraries.
As far as I understand, I need to convert my initial data set as follows:
Initial data set:
user movie rating
1 43 3
1 57 2
2 219 4
Need to pivot to be:
user 1 2
movie 43 3 0
57 2 0
219 0 4
At this point, do I simply need to inject this Matrix into an SVD algorithm as provided by available libraries, and then (somehow) extract results, or is there more work required on my part?
Some information I've read:
http://www.netflixprize.com/community/viewtopic.php?id=1043
http://sifter.org/~simon/journal/20061211.html
http://www.slideshare.net/NYCPredictiveAnalytics/building-a-recommendation-engine-an-example-of-a-product-recommendation-engine
http://www.slideshare.net/bmabey/svd-and-the-netflix-dataset-presentation
.. and a number of other papers
Some libraries:
LingPipe(java)
Jama(java)
Pyrsvd(python)
Any tips at all would be appreciated, especially on a basic data set.
Thanks very much,
Oli
See SVDRecommender in Apache Mahout. Your question about input format entirely depends on what library or code you're using. There's not one standard. At some level, yes, the code will construct some kind of matrix internally. For Mahout, the input for all recommenders, when supplied as a file, is a CSV file with rows like userID,itemID,rating.
Data set: http://www.grouplens.org/node/73
SVD: why not just do it in SAGE if you don't understand how to do SVD? Wolfram alpha or http://www.bluebit.gr/matrix-calculator/ will decompose the matrix for you, or it's on Wikipedia.