Basic Pseudocode for using SVD with Movielens/Netflix type data set - recommendation-engine

I'm struggling to figure out how exactly to begin using SVD with a MovieLens/Netflix type data set for rating predictions. I'd very much appreciate any simple samples in python/java, or basic pseudocode of the process involved. There are a number of papers/posts that summarise the overall concept but I'm not sure how to begin implementing it, even using a number of the suggested libraries.
As far as I understand, I need to convert my initial data set as follows:
Initial data set:
user movie rating
1 43 3
1 57 2
2 219 4
Need to pivot to be:
user 1 2
movie 43 3 0
57 2 0
219 0 4
At this point, do I simply need to inject this Matrix into an SVD algorithm as provided by available libraries, and then (somehow) extract results, or is there more work required on my part?
Some information I've read:
http://www.netflixprize.com/community/viewtopic.php?id=1043
http://sifter.org/~simon/journal/20061211.html
http://www.slideshare.net/NYCPredictiveAnalytics/building-a-recommendation-engine-an-example-of-a-product-recommendation-engine
http://www.slideshare.net/bmabey/svd-and-the-netflix-dataset-presentation
.. and a number of other papers
Some libraries:
LingPipe(java)
Jama(java)
Pyrsvd(python)
Any tips at all would be appreciated, especially on a basic data set.
Thanks very much,
Oli

See SVDRecommender in Apache Mahout. Your question about input format entirely depends on what library or code you're using. There's not one standard. At some level, yes, the code will construct some kind of matrix internally. For Mahout, the input for all recommenders, when supplied as a file, is a CSV file with rows like userID,itemID,rating.

Data set: http://www.grouplens.org/node/73
SVD: why not just do it in SAGE if you don't understand how to do SVD? Wolfram alpha or http://www.bluebit.gr/matrix-calculator/ will decompose the matrix for you, or it's on Wikipedia.

Related

Enhancing accuracy of knn classifier

I have training set of size 54 * 65536 and a testing set of 18 * 65536.
I want to use a knn classifier, but I have some questions:
1) How should I define trainlabel?
Class = knnclassify(TestVec,TrainVec, TrainLabel,k);
Is it a vector of size 54 * 1 that defines to which group each row in training set belongs? Here the group is numbered as 1 ,2,..
2) To find the accuracy I used this:
cp = classperf(TrainLabel);
Class = knnclassify(TestVec,TrainVec, TrainLabel);
cp = classperf(TestLabel,Class);
cp.CorrectRate*100
Is this right? Is there another method to calculate it?
3) How can I enhance the accuracy?
4) How do I choose the best value of k?
I do not know matlab nor the implementation of the knn you are providing, so I can answer only a few of your questions.
1) You assumption is correct. trainlabel is a 54*1 vector or an array of size 54 or something equivalent that defines which group each datapoint (row) in training set belongs to.
2) ... MATLAB / implementation related, sorry.
3) That is a very big discussion. Possible ways are:
Choose a better value of K.
Preprocess the data (or make preprocessing better if already applied).
Get a better / bigger trainset.
to name a few...
4) You can use different values while measuring the accuracy for each one and keep the best. (Note: If you do that, make sure you do not measure the accuracy of the classifier per value of k only once, but rather you use some technique like 10-Folding or something).
There is more than a fair chance that the library you are using for the K-NNclassifier provides such utilities.

How to handle huge sparse matrices construction using Scipy?

So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take.
The files are preprocessed and hence are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm
and the format is:
from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)
so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse or scipy.dok.sparse, now my issue is:
How on earth do I go about converting the .txt file with the link information to a sparse matrix? Read it and compute it as a normal N*N matrix then convert it or what? I have no idea.
Also, the links sometimes span across lines so what would be the correct way to handle that?
eg: a random line is like..
[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]
exactly like this: no commas & no delimiters.
Any information on sparse matrix construction and data handling across lines would be helpful.
Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information about the matrix formats here:
There are several ways to get to your desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due high memory requirements (about 10^12 entries!).
In your case I would prepare your data to construct a coo_matrix.
coo_matrix((data, (i, j)), [shape=(M, N)])
data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries
You might also want to have a look at lil_matrix, which can be used to incrementally build your matrix.
Once you created the matrix you can then convert it to a better suited format for calculation, depending on your use case.
I do not recognize the data format, there might be parsers for it, there might not. Writing your own parser should not be very difficult, though. Each line containing a colon starts a new row, all indices after the colon and in consecutive lines without colons are the column entries for said row.

How to map ranges of values in MATLAB

I'm hoping there is a MATLAB function similar to this Arduino function:
http://arduino.cc/en/Reference/map
Basically I have a time based variable with 67 data points ranging from 0 to 1.15, and I want to map that from 0 to 100% (so, 101 data points). In Arduino that would look something like:
map(value, fromLow, fromHigh, toLow, toHigh)
I can use interp1 in MATLAB to get me the 101 data points, but I just get 101 data points between 0 and 1.15. I know I can just multiply each value by 100/1.15, but this is inexact. Is there a more elegant way to do this in MATLAB that I'm overlooking?
(This post looked hopeful, but it's not what I'm looking for:
Map function in MATLAB?)
Thanks
If you have neural networks toolbox available, then you can try mapminmax function. By default, function maps to [-1 1] interaval and gets input bounds from data. But I believe that filling settings structure with your values and then calling mapminmax should help.
you can use linspace, for example
linspace(0,1.15,101)
will get you 101 points spread uniformly between the limits 0 and 1.15.
My FEX submission maptorange can do exactly that. It takes initial value(s), the range from which they originate, and the range onto which they should be mapped, and returns the mapped value(s). In your example, that would be:
maptorange(values, [0 1.15], [0 100]);
(This is assuming linear mapping. The script can also map along an exponential function.)
To go from 67 to 101 values, you would indeed need interpolation. This can be done either before or after mapping.

Choosing desired rows in matlab

I have a data of 9 columns 14470 rows,
The first column is filled with 0 or 1. Zero means that there is no measurment and the whole row is not in my interest.... can some body help me in writing a loop which go through all lines and filter the data when in first column 1 exist?
You do not need a loop for this, remember Matlab is a matrix oriented programming language, loops should be avoided. I won't give you the answer, I think you can figure it out yourself, it's easy. This tutorial will help.
Have fun.

How to use KNN in Matlab

I need to use KNN in matlab to find the closest data in training data from A.
I have data in .mat that has this kind of information (training data):
train_data = 1 232 34 21 0.542
2 32 333 542 0.32
and so on.
Then i have a second information that I will gather through the application but I will only get
A = 2 343 543 43 0.23
So now my question is do i only need to do is something like this, and can i use something like this?
Does KNN need to learn something or do you only need to load test data and some present data (like A) and go through some formula or preload in another function that learns it then through a second function to give you the result.
Best regards.
So you have a training set (with labels) and some test data without labels? I think you can use the function you linked to classificationknn(). If i understand your question you want something like the example: Predict Classification Based on a KNN Classifier
http://www.mathworks.se/help/stats/classification-using-nearest-neighbors.html#btap7nm