Confused by Jaccard Similarity concept - recommendation-engine

I am going through item-item similarity and my professor said that if I have a popularity based collaborative filtering, then we need to normalize using the Jaccard similarity.
I have the following data
Jack watched movie 2, movie 1, movie 3
Bob watched movie 1 and movie 3
Tim watched movie 1
And For Tim we need to recommend the movies using item based collaborative filtering.
So the co-occurrence matrix based on the number of movies a user has seen is:
movie 1 movie 2 movie 3
movie 1 0 1 2
movie 2 1 0 1
movie 3 2 1 0
My professor says after normalization of above matrix using Jaccard similarity we get the following matrix:
movie 1 movie 2 movie 3
movie 1 0 1/3 2/2
movie 2 1/3 0 1/2
movie 3 2/3 1/2 0
Can some one please explain why people who saw movie 1 and then movie 3 are not similar to someone who saw movie 3 and then movie 1?

Related

Clustering data with different approaches

i have the following type of data:
*.edge file has the connections between ids of different users:
1 23
4 67
...
*.feat contains properties of the ids. Here the first column (column 0) are the userids. The other ones are representing features named in another file. For example userid 1 does not have the feature of column 1 (0), but userid 4 does (1):
1: 0 0 1 0 1 1 0 1 1
4: 1 0 1 1 1 0 1 1 1
...
Now i want to cluster the data and want to use different algorithms like k-means, DBSCAN, hierarchical clustering and so on. But as i read, there are several problems with multidimensional data?
There are problems with very high-dimensional data, but 10 is not high. You have other problems: k-means needs coordinates to compute means, not a graph with edges. Also, the values should be continuous, not binary. You need to study these methods in more detail. If you say "But as I read ...", then try to give a reference.

Extending Rabin-Karp algorithm to hash a 2D matrix

I'm trying to solve a problem here, it asks to find the size of the biggest common subsquare between two matrices.
e.g.
Matrix #1
3 3
1 2 0
1 2 1
1 2 3
Matrix #2
3 3
0 1 2
1 1 2
3 1 2
Answer: 2
Biggest common subsquare is:
1 2
1 2
I know that Rabin-Karp algorithm can be extended to work on a 2D matrix, but I can't understand how exactly can we do that, I tried to understand the author's code in the editorial, but its too complicated, I also did some search for a good explanation, but I couldn't find a clear one.
Can anyone simply explain how can I use Rabin-Karp algorithm to hash a matrix, I know I will hash rows and columns, but I can't see how to mix their hashes together to come up with a hashed matrix, and how the rolling hash function will be handled in this case ?

Matlab Facebook Information Gender Regognition

Our teacher asked us to use any classifier to guess if a facebook user is male or female based on the information it has (Music, Books, Movies, Sports, People).
I divided Music, Books and Movies into genres and Sports to YES/NO and People (if he liked a page of a male or female) to Woman/Man.
For example Music(1,1)=Hip Hop, Music(2,1)=Pop.
In the second column, I put my guess if the user is male or female.
For example, I guessed if Movie=Romantic then gender=woman etc.
Then I made a matrix named MuMoBSP (Music, Movies, Books, Sports, People) and I entered my guesses and put 1 for male and 2 for female.
I found a C++ like way to make it work but I need to use classifiers.
Can you help me?
My code is:
MuMoBSP=[1 1;2 1;3 2;4 2;5 2;6 2;7 1;8 1;9 1;10 1;11 1;12 2;13 2;14 2;15 2;16 2;17 1;18 1;19 1;20 1;21 1;22 1;23 1;24 2;25 2;26 2;27 2;28 1;29 2;30 1;31 2]
filename='Facebook.csv'
Data=dlmread(filename)
%Music Based Gender%
for k=1:6
if (Data(1,1)==MuMoBSP(k,1))
Gender(1,1)=MuMoBSP(k,2);
end
end
%Movies Based Gender%
for k=7:16
if (Data(1,2)==MuMoBSP(k,1))
Gender(1,2)=MuMoBSP(k,2);
end
end
%Books Based Gender%
for k=17:27
if (Data(1,3)==MuMoBSP(k,1))
Gender(1,3)=MuMoBSP(k,2);
end
end
%Sports Based Gender%
for k=28:29
if (Data(1,4)==MuMoBSP(k,1))
Gender(1,4)=MuMoBSP(k,2);
end
end
%People Based Gender%
for k=30:31
if (Data(1,5)==MuMoBSP(k,1))
Gender(1,5)=MuMoBSP(k,2);
end
end
%Print if Man/Woman%
if (sum(Gender)== 9)
sprintf('woman');
end
if (sum(Gender)== 8)
sprintf('woman');
end
if (sum(Gender)== 7)
sprintf('man');
end
if (sum(Gender)== 6)
sprintf('man');
end
if (sum(Gender)== 5)
sprintf('man');
end
if (sum(Gender)== 10)
sprintf('woman');
end
Facebook.csv file is given below. Its 1st column is Music, the 2nd is Movies, the 3rd is Books, the 4rd is Sports and 5th is People.
2;7;17;28;30
1;8;17;28;30
2;10;23;28;30
2;11;22;28;30
1;7;21;28;30
2;9;18;28;30
1;7;19;28;30
3;12;24;29;31
4;14;27;29;31
4;16;27;29;31
6;13;25;29;31
6;14;26;29;31
5;16;27;29;31
5;12;26;29;31
UPDATE
I changed the MuMoBSP and the data sheet(see at the top) like hbaderts suggested.
MuMoBSP =
1 1
2 1
3 2
4 2
5 2
6 2
7 1
8 1
9 1
10 1
11 1
12 2
13 2
14 2
15 2
16 2
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 2
25 2
26 2
27 2
28 1
29 2
30 1
31 2
I tried to use the k-means function but I think I made some mistakes.
[idx,C] = kmeans(Data,2);
figure;
plot(Data(idx==1,1),Data(idx==1,2),'r.','MarkerSize',20)
hold on
plot(Data(idx==2,1),Data(idx==2,2),'b.','MarkerSize',20)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off
silhouette(Data,idx)
Left Plot before silhouette
Why they are so far from the centroid?? How can I fix that?
Theory
You are probably looking for k-means clustering. The idea is quite simple: we estimate a "prototype" male and female. If a data point (person) is closer to the average, prototype male, then it will also be a male. If the data point is closer to the average female, it will be a female. We do this using the following algorithm:
Choose k (in your case: 2) random initial centroid points.
Our two centroid points are our "prototypes" of a male and a female: an average female is specified by the centroid of the "female" cluster, and an average male is the centroid of the "male" cluster.
For each data point, we calculate the nearest centroid. If a data point is nearer to centroid 1, we assign e.g. the label "Male". If it is nearer to centroid 2, we assign the label "Female".
So far, this assignment is completely random - now we have to iteratively fit our clusters to the data we have.
For both clusters, we calculate the new mean value over all data points, i.e. the mean music genre, the mean movie genre and so on for our "Male" and "Female" clusters.
This new mean value is an approximation of the real underlying cluster means. So we repeat step 2, to assign the data points to the corrected clusters. Some data points which were previously "female" will now be assigned to "male", and vice-versa. Of course, some will stay the same.
As our clusters have changed, the mean values we calculated in step 3 have changed too, so we repeat step 3, and find our new cluster centroids. So we will also have to repeat step 2 again, and step 3 again, and so on. We repeat step 2 and 3, until our assignments don't change anymore, which means we have found a solution.
Implementation in MATLAB
In Matlab, there is a kmeans function, which makes this as simple as calling
idx = kmeans(Data, 2);
Of course, Matlab doesn't know about "male" or "female", so there is only cluster 1 and cluster 2, and it will be your job to judge which one is male, and which is female. I assume, the one who likes Sci-Fi movies and books, watches sports and follows women's profiles will be the man ;-)
But wait...
Let's look at the music preference: in your code, you say the following:
1: Hip Hop Male
2: Pop Female
3: Jazz Female
4: Metal Male
5: Blues Female
6: Rock Female
if those 6 preferences are equally likely, an average man will have a "music value" of (1+4)/2 = 2.5 and a female will have an average "music value" of (2+3+5+6)/4 = 4. So somebody who likes Pop music will be more likely to be a male, even though we don't want that!
Why does that happen? - For k-means clustering, we want inputs where a low value corresponds to cluster-1 and a high value corresponds to cluster-2 (or the other way around, that doesn't matter). The important thing is, that we need inputs which allow us to calculate meaningful "average persons".
If you can say, that Hip Hop is "a bit manly" and Metal is "very manly" music, while Rock is "a bit feminine", Blues is "more feminine", Jazz is "even more feminine" and Pop is "very feminine", you could change the labels to
1: Metal
2: Hip Hop
3: Rock
4: Blues
5: Jazz
6: Pop
then, somebody with a very high value listens to "feminine" music, while somebody with a low value listens to "manly" music. An average man will have a value of 1.5, and an average woman will have a value of 4.5.
If that is not the case (as it is probably in your case), you can for example create a bunch of new input variables:
x_1: Person likes Hip-Hop
x_2: Person likes Pop
...
where each variable is either 0 (false) or 1 (true). Instead of having 5 input variables, you will have e.g. 31 input variables, which are either 0 or 1. This comes with the advantage, that you can use continuous values too: somebody who likes two pop bands and 3 metal bands can get a 0.4 for Pop and 0.6 for Metal.

How to draw a Histogram in Matlab

I have a set of around 35000 data. These data are the signal strengths received only from a single location for different time interval of time. I want to plot a Histogram using these data. My X-axis will give the information about "Signal Strengths" and my Y-axis will give the information about "Probability". My histogram will consists of different bars which will give information about the signal strength and probabilities.
For example, suppose I have the following data
a= [ 1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 5 6 6 6 6 6 6 6 6 6 6 6]
How can I plot the graph using data at X-axis and Probability at Y-axis? Any help will be appreciated. Thanks!
This should work just fine if you don't want to use some predefined functions:
una=unique(a);
normhist=hist(a,size(unique(a),2))/sum(hist(a));
figure, stairs(una,normhist)
Una has only the unique values of a, normhist is now between 0 and 1 and it's the probability of occurring of the individual signal because you divide it by the number of elements included in the data.

comparing images matlab

Ok so let's say i have a binary image containing the pixel representation for 1,2,A,B or whatever. But for now let's just consider 1
0 0 0 0
0 1 1 0
0 1 1 0
0 1 1 0
0 1 1 0
0 0 0 0
and then i have another image containing the standard representation of 1.
Now what i wan't is to compare these two images and decide whether my first image contains pixel values for 1 or not.
What kind of algorithms are available at my disposal ?
Please i do not require the name of the matlab function for image comparison as has been the answer for similar questions. Rather than that i require the name of some algorithms that can be used to solve this problem so that i can implement it on my own in C#
What you need to compute is the distance between your image and the ground truth. This distance can be stated in many different ways. Search google for similarity measures on binary data. See here a review.