Matlab Facebook Information Gender Regognition - matlab

Our teacher asked us to use any classifier to guess if a facebook user is male or female based on the information it has (Music, Books, Movies, Sports, People).
I divided Music, Books and Movies into genres and Sports to YES/NO and People (if he liked a page of a male or female) to Woman/Man.
For example Music(1,1)=Hip Hop, Music(2,1)=Pop.
In the second column, I put my guess if the user is male or female.
For example, I guessed if Movie=Romantic then gender=woman etc.
Then I made a matrix named MuMoBSP (Music, Movies, Books, Sports, People) and I entered my guesses and put 1 for male and 2 for female.
I found a C++ like way to make it work but I need to use classifiers.
Can you help me?
My code is:
MuMoBSP=[1 1;2 1;3 2;4 2;5 2;6 2;7 1;8 1;9 1;10 1;11 1;12 2;13 2;14 2;15 2;16 2;17 1;18 1;19 1;20 1;21 1;22 1;23 1;24 2;25 2;26 2;27 2;28 1;29 2;30 1;31 2]
filename='Facebook.csv'
Data=dlmread(filename)
%Music Based Gender%
for k=1:6
if (Data(1,1)==MuMoBSP(k,1))
Gender(1,1)=MuMoBSP(k,2);
end
end
%Movies Based Gender%
for k=7:16
if (Data(1,2)==MuMoBSP(k,1))
Gender(1,2)=MuMoBSP(k,2);
end
end
%Books Based Gender%
for k=17:27
if (Data(1,3)==MuMoBSP(k,1))
Gender(1,3)=MuMoBSP(k,2);
end
end
%Sports Based Gender%
for k=28:29
if (Data(1,4)==MuMoBSP(k,1))
Gender(1,4)=MuMoBSP(k,2);
end
end
%People Based Gender%
for k=30:31
if (Data(1,5)==MuMoBSP(k,1))
Gender(1,5)=MuMoBSP(k,2);
end
end
%Print if Man/Woman%
if (sum(Gender)== 9)
sprintf('woman');
end
if (sum(Gender)== 8)
sprintf('woman');
end
if (sum(Gender)== 7)
sprintf('man');
end
if (sum(Gender)== 6)
sprintf('man');
end
if (sum(Gender)== 5)
sprintf('man');
end
if (sum(Gender)== 10)
sprintf('woman');
end
Facebook.csv file is given below. Its 1st column is Music, the 2nd is Movies, the 3rd is Books, the 4rd is Sports and 5th is People.
2;7;17;28;30
1;8;17;28;30
2;10;23;28;30
2;11;22;28;30
1;7;21;28;30
2;9;18;28;30
1;7;19;28;30
3;12;24;29;31
4;14;27;29;31
4;16;27;29;31
6;13;25;29;31
6;14;26;29;31
5;16;27;29;31
5;12;26;29;31
UPDATE
I changed the MuMoBSP and the data sheet(see at the top) like hbaderts suggested.
MuMoBSP =
1 1
2 1
3 2
4 2
5 2
6 2
7 1
8 1
9 1
10 1
11 1
12 2
13 2
14 2
15 2
16 2
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 2
25 2
26 2
27 2
28 1
29 2
30 1
31 2
I tried to use the k-means function but I think I made some mistakes.
[idx,C] = kmeans(Data,2);
figure;
plot(Data(idx==1,1),Data(idx==1,2),'r.','MarkerSize',20)
hold on
plot(Data(idx==2,1),Data(idx==2,2),'b.','MarkerSize',20)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off
silhouette(Data,idx)
Left Plot before silhouette
Why they are so far from the centroid?? How can I fix that?

Theory
You are probably looking for k-means clustering. The idea is quite simple: we estimate a "prototype" male and female. If a data point (person) is closer to the average, prototype male, then it will also be a male. If the data point is closer to the average female, it will be a female. We do this using the following algorithm:
Choose k (in your case: 2) random initial centroid points.
Our two centroid points are our "prototypes" of a male and a female: an average female is specified by the centroid of the "female" cluster, and an average male is the centroid of the "male" cluster.
For each data point, we calculate the nearest centroid. If a data point is nearer to centroid 1, we assign e.g. the label "Male". If it is nearer to centroid 2, we assign the label "Female".
So far, this assignment is completely random - now we have to iteratively fit our clusters to the data we have.
For both clusters, we calculate the new mean value over all data points, i.e. the mean music genre, the mean movie genre and so on for our "Male" and "Female" clusters.
This new mean value is an approximation of the real underlying cluster means. So we repeat step 2, to assign the data points to the corrected clusters. Some data points which were previously "female" will now be assigned to "male", and vice-versa. Of course, some will stay the same.
As our clusters have changed, the mean values we calculated in step 3 have changed too, so we repeat step 3, and find our new cluster centroids. So we will also have to repeat step 2 again, and step 3 again, and so on. We repeat step 2 and 3, until our assignments don't change anymore, which means we have found a solution.
Implementation in MATLAB
In Matlab, there is a kmeans function, which makes this as simple as calling
idx = kmeans(Data, 2);
Of course, Matlab doesn't know about "male" or "female", so there is only cluster 1 and cluster 2, and it will be your job to judge which one is male, and which is female. I assume, the one who likes Sci-Fi movies and books, watches sports and follows women's profiles will be the man ;-)
But wait...
Let's look at the music preference: in your code, you say the following:
1: Hip Hop Male
2: Pop Female
3: Jazz Female
4: Metal Male
5: Blues Female
6: Rock Female
if those 6 preferences are equally likely, an average man will have a "music value" of (1+4)/2 = 2.5 and a female will have an average "music value" of (2+3+5+6)/4 = 4. So somebody who likes Pop music will be more likely to be a male, even though we don't want that!
Why does that happen? - For k-means clustering, we want inputs where a low value corresponds to cluster-1 and a high value corresponds to cluster-2 (or the other way around, that doesn't matter). The important thing is, that we need inputs which allow us to calculate meaningful "average persons".
If you can say, that Hip Hop is "a bit manly" and Metal is "very manly" music, while Rock is "a bit feminine", Blues is "more feminine", Jazz is "even more feminine" and Pop is "very feminine", you could change the labels to
1: Metal
2: Hip Hop
3: Rock
4: Blues
5: Jazz
6: Pop
then, somebody with a very high value listens to "feminine" music, while somebody with a low value listens to "manly" music. An average man will have a value of 1.5, and an average woman will have a value of 4.5.
If that is not the case (as it is probably in your case), you can for example create a bunch of new input variables:
x_1: Person likes Hip-Hop
x_2: Person likes Pop
...
where each variable is either 0 (false) or 1 (true). Instead of having 5 input variables, you will have e.g. 31 input variables, which are either 0 or 1. This comes with the advantage, that you can use continuous values too: somebody who likes two pop bands and 3 metal bands can get a 0.4 for Pop and 0.6 for Metal.

Related

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Should the output of backpropogation converge to 1 given the output is (0,1)

I am currently trying to understand the ANN that I created for an assignment that essentially takes gray scale (0-150)images (120x128) and determines whether the person is Male or Female. It works for the most part. I am treating this like a boolean problem where the output(Male = 1, Female = 0). I am able to get the ANN to correctly identify Male or Female. However the outputs I am getting for the Males are (0.3-0.6) depending on the run. Should I be getting the value ~1 out?
I am using a sigmoid unit 1/(1+e^-y) and have tried to take the inverse. I have tried this using 5 - 60 hidden units on 1 layer and tried 2 outputs with flip flop results. I want to understand this so that I can apply this to a non-boolean problem. ie If I want a numerical output how would I got about doing that or am I using the wrong machine learning technique?
You can use binary function at the output with some threshold. Assuming, you have assigned 0 for female and 1 for male in training, while testing you will get values in between 0 and 1 and also some times below 0 and above 1......So to make a decision at the output value just add threshold of 0.5 and check output value, if it is less than 0.5 then estimated class is female and if it is equal to or greater than 0.5 then estimated class is male.

What would be a good neural net model for pick 3 lotteries?

This is just for fun to see if neural network predictions increase my odds of getting pick 3 lotteries correct.
Right now i just have a simple model of 30 input units, 30 hidden units, and 30 output units.
30 because if the pick 3 result was something like 124, i would make so that all my inputs are 0's except input[1] = 1 (because i assign 0 to 9 for the first digit), input[12] = 1 (because i assign 10 to 19 for the middle digit), input[24] = 1 (because i assign 20 to 29 for the last digit). I just do that so that my inputs are able store placement of digits.
i am training it so that if enter inputs for one draw, it gives me outputs for the next draw.
Do you know of a better model (if you have had experience with neural networks that dealt with pick3 lotteries)?

Nearest Neighbour Classifier for multiple features

I have a dataset set that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Class
Obj 2 2 2 8 5 1
Obj 2 8 3 3 4 2
Obj 1 7 4 4 8 1
Obj 4 3 5 9 7 2
The rows contain objects, which have a number of features. I have put 5 features for demonstration purposes but there is approximately 50 features per object, with the final column being the class label for each object.
I want to create and run the nearest neighbour classifier algorithm on this data set and retrieve the error rate.I have managed to get the NN algorithm working for each feature, a short Pseudo code example is below. For each feature I loop through each object, assigning object j according to its nearest neighbours.
for i = 1:Number of features
for j = 1:Number of objects
distance between data(j,i) and values of feature i
order by shortest distance
sum or the class labels k shortest distances
assign class with largest number of labels
end
error = mean(labels~=assigned)
end
The issue I have is how would I work out the 1-NN algorithm for multiple features. I will have a selection of the features from my dataset say features 1,2 and 3. I want to calculate the error rate if I add feature 5 into my set of selected features. I want to work out the error using 1NN. Would I find the nearest value out of all my features 1-3 in my selected feature?
For example, for my data set above:
Adding feature 5 - For object 1 of feature 5 the closest number to that is object 4 of feature 3. As this has a class label of 2 I will assign object 1 of feature 5 the class 2. This is obviously a misclassification but I would continue to classify all other objects in Feature 5 and compare the assigned and actual values.
Is this the correct way to perform the 1NN against multiple features?