Can we not using the one hot user feature when training the factorization machine? - recommendation-engine

Now I have n users and k total items, for each user, he has bought some items, and I want to train a FM on the existing users/items dataset, with which I will be able to recommend new items to a new user who has already got some items.
What if I arrange my feature X as,
items all items y
0 ... 1 ... 0 1, 0, ..., 1, 0 1
items all items y
0 ... 0 ... 1 1, 0, ..., 1, 0 0
therefore when predicting for the new user, I don't need to find the user that is most similar to the new user in training dataset and recommend for the old user instead.

In most cases, these sparse features are useful for memorization, especially 1st order feautre. Of course, we can also use field embedding layer to convert these onehot features to dense features, before applying other structures, like DNN, PNN, etc.

Related

Cumulative counting number of occurrences of integers in MATLAB array including zero occurences

I am constructing a classification tree in MATLAB, however I have too many features and I want to select the right set of features to maximize my accuracy. To do that, I have a brute-force algorithm that tries all possible subsets of features (e.g. if my data has 4 features, I would try selecting features (1, 2), (1, 3), (1, 4), (2,1)...; (1, 2, 3), (1, 2, 4) etc., and if for a given subset of features, my accuracy is above 84%, I want to save the feature numbers that was used.
So what I would like to do is have an array that can count the number of occurrences of each feature number for each feature set that produced an accuracy above 84%. This array would also have to be cumulative over all iterations of feature combinations.
I've seen other posts for counting integer occurrences in Matlab arrays, but for me a) the size of the array to be analyzed changes throughout execution and b) if there are no occurrences of a number, I want the counting array to display 0 in its place.
For example, if I have two feature subsets [1 2 7] and [1 3 6 8] I would like my counting array to be
[2 1 1 0 0 1 1 1 0 0] (assuming I have ten features total)

Partitioning sample for cross validation within subsets of the sample (by school) (Matlab)

I need to partition a data set 5 ways for 5 fold cross validation using matlab. However, the dataset pertains to students nested in schools, so instead of splitting the entire data set 5 ways, I want to split each school 5 ways. After building 5 training and 5 test data sets I want to be able to tell which observations in each data set belong to each school.
In the full data set I have a feature vector x (28x15363) and target vector y (1x15363). I also have an index vector (task_indexes) that identifies the first observation for each school.
I would like to build a set of 5 training set indexes and 5 corresponding test set indexes based on a 5 fold partition of each school, and I will need a set of task indexes, 5 tr_indexes which identify the first observation of each school in each training set, and tsk_indexes which do the same for the test set.
The training set indexes tr, test set indexes tst, training set task indexes tr_indexes and test set task indexes tst_indexes should fit into the following code.
gammas = [.1, .01, .001, .0001, .00001]
errgrid = zeros(5,1)
for g = 1:length(gammas)
gammas = gammas(4)
[testerrs,theW,theD] = ...
run_code_example(gammas, x(:, tr), y(tr), x(:, tst), y(tst), tr_indexes, tst_indexes, Dini, 10, 'feat', 0, fname);
errgrid(g) = testerrs
gammaselect = gammas(min(errgrid))
In this code, I use a partition provided by the authors with training set indexes tr and test indexes tst, and task indexes tr_indexes for the training set and tst_indexes for the test set.
Any advice on how to carry out this partition?
I tried to write this post to the standard of writing a good question. If anyone can provide feedback to improve the question it is greatly appreciated.

Clustering data with different approaches

i have the following type of data:
*.edge file has the connections between ids of different users:
1 23
4 67
...
*.feat contains properties of the ids. Here the first column (column 0) are the userids. The other ones are representing features named in another file. For example userid 1 does not have the feature of column 1 (0), but userid 4 does (1):
1: 0 0 1 0 1 1 0 1 1
4: 1 0 1 1 1 0 1 1 1
...
Now i want to cluster the data and want to use different algorithms like k-means, DBSCAN, hierarchical clustering and so on. But as i read, there are several problems with multidimensional data?
There are problems with very high-dimensional data, but 10 is not high. You have other problems: k-means needs coordinates to compute means, not a graph with edges. Also, the values should be continuous, not binary. You need to study these methods in more detail. If you say "But as I read ...", then try to give a reference.

training neural network for user recognition in MATLAB

I'm working on gait recognition problem, the aim of this study is to be used for user authentication
I have data of 36 users
I've successfully extracted 143 features for each sample (or example) which are (36 rows and 143 columns) for each user
( in other words, I have 36 examples and 143 features are extracted for each example. Thus a matrix called All_Feat of 36*143 has been created for each individual user).
By the way, column represents the number of the extracted features and row represents the number of samples (examples) for each feature.
Then I have divided the data into two parts, training and testing (the training matrix contains 25 rows and 143 columns, while the testing Matrix contains 11 rows and 143 columns).
Then, for each user, I divided the matrix (All_Feat) into two matrixes ( Training matrix, and Test matrix ).
The training matrix contains ( 25 rows (examples) and 143 columns), while the testing matrix has (11 rows and 143 columns).
I'm new to classification and this stuff
I'd like to use machine learning (Neural Network) for classifying these features.
Therefore, the first step I need to create a reference template for each user ( which called training phase)
this can be done by training the classifier with the user's features (data) and the remaining Users as well (35 users are considered as imposters).
Based on what I have read, training Neural Network requires two classes, the first class contains all the training data of genuine user (e.g. User1) and labelled with 1 , while the second class has the training data of imposters labelled as 0 (which is binary classification, 1 for the authorised user and 0 for imposters).
**now my question is: **
1- i dont know how to create these classes!
2- For example, if I want to train Neural Network for User1, I have these variables, input and target. what should I assign to these variables?
should input= Training matrix of User1 and Training matrixes of User2, User3,.....User35 ?
Target=what should i assign to this matrix?
I really appreciate any help!
Try this: https://es.mathworks.com/help/nnet/gs/classify-patterns-with-a-neural-network.html
A few notes:
You said that, for each user, you have extracted 136 features. It sounds like you have only one repetition for each user (i.e. the user has tried-used the system once). However, I don't know the source of your data, but I dunno that it hasn't got some type of randomness. You mention gait analysis, and that sounds like that the recorded data of a given one user will be different each time that user uses the system. In other words: the user uses your system, you capture the data, you extract the 136 features (numbers); then, the user uses again the system, but the extracted 136 features will be slightly different. Therefore, you should get several examples for each user to train the classifier. In terms of "matlab matrix" your matrix should have one COLUMN for each example, and 136 rows (each of you features). Since you should have several repetitions for each user (for example 10 times), your big matrix should be something like: 136 rows x 360 columns.
You should "create" one new neural network for each user. Given a user (for example User4), you create a dataset (a new matrix) with samples of that user, and samples of several other users (User1, User3, User5...). You do a binary classification (cases: "user4" against "other users"). After training, it would be advisable to test the classifier with data of other users whose data was not present during the training phase (for example User2 and others). Since you are doing a binary classification your matrices should be somthing like follows:
Example, you have 10 trials (examples) of each user. You want to create a neural network to detect the user User1. The matrix should be like:
(notation cU1_t1 means: column with features of user 1, trial 1)
input_matrix = [cU1_t1; cU1_t2; ...; cU1_t10; cU2_t1; ...; cU36_t10]
The target matrix should be like:
target = a matrix whose 10 first columns are [ 1, 0], and the other 350 columns are [0, 1]. That means that the first 10 columns are of type A, and the others of type B. In this case "type A" means "User1", and "type B" means "Not User1".
Then, you should segment the data (train data, validation data, test data) to train the nerual network and so on. Remember to save some users just for the testing phase, for example, the train matrix should not have any of the columns of five users: user2, user6, user7, user10, user20 (50 columns).
I think you get the idea.
Regards.
************ UPDATE: ******************************
This example assumes that the user selects/indicates its name and then the system uses the neural network to authenticate the user (like a password). I will give you an small example with random numbers.
Let's say you have recorded data from 15 users (but in the future you will have more). You record "gait data" from them when they do something with your recording device. From the recorded signals you extract some features, let's say you extract 5 features (5 numbers). Hence, everytime a user uses the machine you get 5 numbers. Even if user is the same, the 5 numbers will be different each time, because the recorded signals have some randomness. Therefore, to train the neural network you have to have several examples of each user. Let's say that you have 18 repetitions performed by each user.
To sum up this example:
There are 15 users available for the experiment.
Each time the user uses the system you record 5 numbers (features). You get a feature vector. In matlab it will be a COLUMN.
For the experiment each user has performed 18 repetitions.
Now you have to create one neural network for each user. To that end, you have to construct several matrices.
Let's say you want to create the neural network (NN) of user 2 (U2). The NN will classify the feature vectors in 2 classes: U2 and NotU2. Therefore, you have to train and test the NN with examples of this. The group NotU2 represents any other user that it is not U2, however, you should NOT train the NN with data of every other user that you have in your experiment. This will be cheating (think that you can't have data from every user in the world). Therefore, to create the train dataset you will exclude all the repetitions of some users to test the NN during the training (validation dataset) and after the trainning (test dataset). For this example we will use users {U1,U3,U4} for validation, and users {U5,U6,U7} for testing.
Therefore you construct the following matrices:
Train input matrix
It wil have 12 examples of U2 (70% more or less) and every example of users {U8,U9,...,U14,U15}. Each example is a column, hence, the train matrix will be a matrix of 5 rows and 156 columns (12+8*18). I will order it as follows: [U2_ex1, U2_ex2, ..., U2_ex12, U8_ex1, U8_ex2, ..., U8_ex18, U9_ex1, ..., U15_ex1,...U15_ex18]. Where U2_ex1 represents a column vector with the 5 features obtained of User 2 during the repetition/example number 1.
-- Target matrix of train matrix. It is a matrix of 2 rows and 156 columns. Each column j represents the correct class of the example j. The column is formed by zeros, and it has a 1 at the row that indicates the class. Since we have only 2 classes the matrix has only 2 rows. I will say that class U2 will be the first one (hence the column vector for each example of this class will be [1 0]), and the other class (NotU2) will be the second one (hence the column vector for each example of this class will be [0 1]). Obviously, the columns of this matrix have the same order than the train matrix. So, according to the order that I have used, the target matrix will be:
12 columns [1 0] and 144 columns [0 1].
Validation input matrix
It will have 3 examples of U2 (15% more or less) and every example of users [U1,U3,U4]. Hence, this will be a matrix of 6 rows and 57 columns ( 3+3*18).
-- Target matrix of validation matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
Test input matrix
It will have the remaining 3 examples of U2 (15%) and every example of users [U5,U6,U7]. Hence, this will be a matrix of 6 rows and 57 columns (3+3*18).
-- Target matrix of test matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
IMPORTANT. The columns of each matrix should have a random order to improve the training. That is, do not put all the examples of U2 together and then the others. For this example I have put them in order for clarity. Obviously, if you change the order of the input matrix, you have to use the same order in the target matrix.
To use MATLAB you will have to to pass two matrices: the inputMatrix and the targetMatrix. The inputMatrix will have the train,validation and test input matrices joined. And the targetMatrix the same with the targets. So, the inputMatrix will be a matrix of 6 rows and 270 columns. The targetMatrix will have 2 rows and 270 columns. For clarity I will say that the first 156 columns are the trainning ones, then the 57 columns of validation, and finally 57 columns of testing.
The MATLAB commands will be:
% Create a Pattern Recognition Network
hiddenLayerSize = 10; %You can play with this number
net = patternnet(hiddenLayerSize);
%Specify the indices of each matrix
net.divideFcn = 'divideind';
net.divideParam.trainInd = [1: 156];
net.divideParam.valInd = [157:214];
net.divideParam.testInd = [215:270];
% % Train the Network
[net,tr] = train(net, inputMatrix, targetMatrix);
In the open window you will be able to see the performance of your neural network. The output object "net" is your neural network trained. You can use it with new data if you want.
Repeat this process for each other user (U1, U3, ...U15) to obtain his/her neural network.

Find groups with high cross correlation matrix in Matlab

Given a lower triangular matrix (100x100) containg cross-correlation
values, where entry 'ij' is the correlation value between signal 'i'
and 'j' and so a high value means that these two signals belong to
the same class of objects, and knowing there are at most four distinct
classes in the data set, does someone know of a fast and effective way
to classify the data and assign all the signals to the 4 different
classes, rather than search and cross check all the entries against
each other? The following 7x7 matrix may help illustrate
the point:
1 0 0 0 0 0 0
.2 1 0 0 0 0 0
.8 .15 1 0 0 0 0
.9 .17 .8 1 0 0 0
.23 .8 .15 .14 1 0 0
.7 .13 .77 .83. .11 1 0
.1 .21 .19 .11 .17 .16 1
there are three classes in this example:
class 1: rows <1 3 4 6>,
class 2: rows <2 5>,
class 3: rows <7>
This is a good problem for hierarchical clustering. Using complete linkage clustering you will get compact clusters, all you have to do is determine the cutoff distance, at which two clusters should be considered different.
First, you need to convert the correlation matrix to a dissimilarity matrix. Since correlation is between 0 and 1, 1-correlation will work well - high correlations get a score close to 0, and low correlations get a score close to 1. Assume that the correlations are stored in an array corrMat
%# remove diagonal elements
corrMat = corrMat - eye(size(corrMat));
%# and convert to a vector (as pdist)
dissimilarity = 1 - corrMat(find(corrMat))';
%# decide on a cutoff
%# remember that 0.4 corresponds to corr of 0.6!
cutoff = 0.5;
%# perform complete linkage clustering
Z = linkage(dissimilarity,'complete');
%# group the data into clusters
%# (cutoff is at a correlation of 0.5)
groups = cluster(Z,'cutoff',cutoff,'criterion','distance')
groups =
2
3
2
2
3
2
1
To confirm that everything is great, you can visualize the dendrogram
dendrogram(Z,0,'colorthreshold',cutoff)
You can use the following method instead of creating the dissimilarity matrix.
Z = linkage(corrMat,'complete','correlation')
This allows Matlab to interpret your matrix as correlation distance and then, you can plot the dendrogram as follows:
dendrogram(Z);
One way to verify if your dendrogram is right or not is by checking its maximum height which should correspond to 1-min(corrMat). If the minimum value in corrMat is 0 then the maximum height of your tree should be 1. If the minimum value is -1 (negative correlation), the height should be 2.
Since it is given that there are going to be 4 groups, I'd start with a pretty simplistic two stage approach.
In the first stage you find the maximum correlation among any two elements, place those two elements in a group, then zero out their correlation in the matrix. Repeat, finding the next highest correlation among two elements and either adding those to an existing group or creating a new one until you have the correct number of groups.
Finally, check which elements aren't in a group, go to their column, and identify the highest correlation they have with any other group. If that element is in a group already, place them in that group as well, otherwise skip to the next element and come back to them later.
If there is interest or anything isn't clear I can add code later. Like I said, the approach is simplistic but if you don't need to verify the number of groups I think it should be effective.