training neural network for user recognition in MATLAB - matlab

I'm working on gait recognition problem, the aim of this study is to be used for user authentication
I have data of 36 users
I've successfully extracted 143 features for each sample (or example) which are (36 rows and 143 columns) for each user
( in other words, I have 36 examples and 143 features are extracted for each example. Thus a matrix called All_Feat of 36*143 has been created for each individual user).
By the way, column represents the number of the extracted features and row represents the number of samples (examples) for each feature.
Then I have divided the data into two parts, training and testing (the training matrix contains 25 rows and 143 columns, while the testing Matrix contains 11 rows and 143 columns).
Then, for each user, I divided the matrix (All_Feat) into two matrixes ( Training matrix, and Test matrix ).
The training matrix contains ( 25 rows (examples) and 143 columns), while the testing matrix has (11 rows and 143 columns).
I'm new to classification and this stuff
I'd like to use machine learning (Neural Network) for classifying these features.
Therefore, the first step I need to create a reference template for each user ( which called training phase)
this can be done by training the classifier with the user's features (data) and the remaining Users as well (35 users are considered as imposters).
Based on what I have read, training Neural Network requires two classes, the first class contains all the training data of genuine user (e.g. User1) and labelled with 1 , while the second class has the training data of imposters labelled as 0 (which is binary classification, 1 for the authorised user and 0 for imposters).
**now my question is: **
1- i dont know how to create these classes!
2- For example, if I want to train Neural Network for User1, I have these variables, input and target. what should I assign to these variables?
should input= Training matrix of User1 and Training matrixes of User2, User3,.....User35 ?
Target=what should i assign to this matrix?
I really appreciate any help!

Try this: https://es.mathworks.com/help/nnet/gs/classify-patterns-with-a-neural-network.html
A few notes:
You said that, for each user, you have extracted 136 features. It sounds like you have only one repetition for each user (i.e. the user has tried-used the system once). However, I don't know the source of your data, but I dunno that it hasn't got some type of randomness. You mention gait analysis, and that sounds like that the recorded data of a given one user will be different each time that user uses the system. In other words: the user uses your system, you capture the data, you extract the 136 features (numbers); then, the user uses again the system, but the extracted 136 features will be slightly different. Therefore, you should get several examples for each user to train the classifier. In terms of "matlab matrix" your matrix should have one COLUMN for each example, and 136 rows (each of you features). Since you should have several repetitions for each user (for example 10 times), your big matrix should be something like: 136 rows x 360 columns.
You should "create" one new neural network for each user. Given a user (for example User4), you create a dataset (a new matrix) with samples of that user, and samples of several other users (User1, User3, User5...). You do a binary classification (cases: "user4" against "other users"). After training, it would be advisable to test the classifier with data of other users whose data was not present during the training phase (for example User2 and others). Since you are doing a binary classification your matrices should be somthing like follows:
Example, you have 10 trials (examples) of each user. You want to create a neural network to detect the user User1. The matrix should be like:
(notation cU1_t1 means: column with features of user 1, trial 1)
input_matrix = [cU1_t1; cU1_t2; ...; cU1_t10; cU2_t1; ...; cU36_t10]
The target matrix should be like:
target = a matrix whose 10 first columns are [ 1, 0], and the other 350 columns are [0, 1]. That means that the first 10 columns are of type A, and the others of type B. In this case "type A" means "User1", and "type B" means "Not User1".
Then, you should segment the data (train data, validation data, test data) to train the nerual network and so on. Remember to save some users just for the testing phase, for example, the train matrix should not have any of the columns of five users: user2, user6, user7, user10, user20 (50 columns).
I think you get the idea.
Regards.
************ UPDATE: ******************************
This example assumes that the user selects/indicates its name and then the system uses the neural network to authenticate the user (like a password). I will give you an small example with random numbers.
Let's say you have recorded data from 15 users (but in the future you will have more). You record "gait data" from them when they do something with your recording device. From the recorded signals you extract some features, let's say you extract 5 features (5 numbers). Hence, everytime a user uses the machine you get 5 numbers. Even if user is the same, the 5 numbers will be different each time, because the recorded signals have some randomness. Therefore, to train the neural network you have to have several examples of each user. Let's say that you have 18 repetitions performed by each user.
To sum up this example:
There are 15 users available for the experiment.
Each time the user uses the system you record 5 numbers (features). You get a feature vector. In matlab it will be a COLUMN.
For the experiment each user has performed 18 repetitions.
Now you have to create one neural network for each user. To that end, you have to construct several matrices.
Let's say you want to create the neural network (NN) of user 2 (U2). The NN will classify the feature vectors in 2 classes: U2 and NotU2. Therefore, you have to train and test the NN with examples of this. The group NotU2 represents any other user that it is not U2, however, you should NOT train the NN with data of every other user that you have in your experiment. This will be cheating (think that you can't have data from every user in the world). Therefore, to create the train dataset you will exclude all the repetitions of some users to test the NN during the training (validation dataset) and after the trainning (test dataset). For this example we will use users {U1,U3,U4} for validation, and users {U5,U6,U7} for testing.
Therefore you construct the following matrices:
Train input matrix
It wil have 12 examples of U2 (70% more or less) and every example of users {U8,U9,...,U14,U15}. Each example is a column, hence, the train matrix will be a matrix of 5 rows and 156 columns (12+8*18). I will order it as follows: [U2_ex1, U2_ex2, ..., U2_ex12, U8_ex1, U8_ex2, ..., U8_ex18, U9_ex1, ..., U15_ex1,...U15_ex18]. Where U2_ex1 represents a column vector with the 5 features obtained of User 2 during the repetition/example number 1.
-- Target matrix of train matrix. It is a matrix of 2 rows and 156 columns. Each column j represents the correct class of the example j. The column is formed by zeros, and it has a 1 at the row that indicates the class. Since we have only 2 classes the matrix has only 2 rows. I will say that class U2 will be the first one (hence the column vector for each example of this class will be [1 0]), and the other class (NotU2) will be the second one (hence the column vector for each example of this class will be [0 1]). Obviously, the columns of this matrix have the same order than the train matrix. So, according to the order that I have used, the target matrix will be:
12 columns [1 0] and 144 columns [0 1].
Validation input matrix
It will have 3 examples of U2 (15% more or less) and every example of users [U1,U3,U4]. Hence, this will be a matrix of 6 rows and 57 columns ( 3+3*18).
-- Target matrix of validation matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
Test input matrix
It will have the remaining 3 examples of U2 (15%) and every example of users [U5,U6,U7]. Hence, this will be a matrix of 6 rows and 57 columns (3+3*18).
-- Target matrix of test matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
IMPORTANT. The columns of each matrix should have a random order to improve the training. That is, do not put all the examples of U2 together and then the others. For this example I have put them in order for clarity. Obviously, if you change the order of the input matrix, you have to use the same order in the target matrix.
To use MATLAB you will have to to pass two matrices: the inputMatrix and the targetMatrix. The inputMatrix will have the train,validation and test input matrices joined. And the targetMatrix the same with the targets. So, the inputMatrix will be a matrix of 6 rows and 270 columns. The targetMatrix will have 2 rows and 270 columns. For clarity I will say that the first 156 columns are the trainning ones, then the 57 columns of validation, and finally 57 columns of testing.
The MATLAB commands will be:
% Create a Pattern Recognition Network
hiddenLayerSize = 10; %You can play with this number
net = patternnet(hiddenLayerSize);
%Specify the indices of each matrix
net.divideFcn = 'divideind';
net.divideParam.trainInd = [1: 156];
net.divideParam.valInd = [157:214];
net.divideParam.testInd = [215:270];
% % Train the Network
[net,tr] = train(net, inputMatrix, targetMatrix);
In the open window you will be able to see the performance of your neural network. The output object "net" is your neural network trained. You can use it with new data if you want.
Repeat this process for each other user (U1, U3, ...U15) to obtain his/her neural network.

Related

Calculate False Acceptance and False Rejection Rates for a biometric system (MATLAB)

I have a matrix called (all_output) (which is the output training and testing Neural Network of 36 users). This matrix contains 36 cells, each cell has 504 values ( as shown in the attaced image)
the content of each cell of (all_output) is shown in the attached image
**___Update__**
i will explain how the (all_output) has been constructed
After Neural Network has been trained, I have used that code in order to test the Neural Network
% % % Test the Network %%%%%%%
outputs = net(Testing_Gen{i});
all_output{1,i}=outputs
Testing_Gen: is a matrix of size (1*36 cells as shown in the attached
image).
in order to understand the content of Testing_Gen matrix,
for each user, I have 14 test samples(examples), and for each sample 143 features have been extracted and stored in a column.
Each cell in Testing_Gen matrix contains the user's test samples and the imposter's test samples ( as shown in the attached image)
as we could see that one cell is (143 rows x 504 columns), the first 14 columns in each cell is the user's samples ( genuine user's samples) and the remaining columns are the imposter's samples (490 samples [14*35])
for example, I have extracted 14 samples or examples for User1 to be used for testing, therefore, the first cell contains the test samples (examples) of User1 (which are 14) and the imposter's samples as well (490 samples [14*35]) in order to calculate the FAR and FRR
I'd like to calculate the False Acceptance Rate (FAR), False Rejection Rate (FRR) and Equal Error Rate (EER) for this Matrix.
False Acceptance Rate is the percentage in which the system incorrectly accepts an imposter as the legitimate user.
For example, to calculate the FAR for User1 all the imposter's samples (which are already stored in (all_output) matrix) need to be tested against User1 and repeat this procedure 36 times.
False Rejection Rate displays the percentage in which the authorised user is wrongly rejected by the system.
For example, to calculate the FRR for User1 all his testing samples (which are already stored in (all_output) matrix) need to be tested against User1 and repeat this procedure for each genuine user (36 times).
EER simply can be calculated using this equation (FAR+FRR)/2
while calculating EER, the EER's results should show the necessity of having a balance between FRR and FAR for the system (in other words, the value of FAR and FRR should be close to each other as much as possible as my system aim to have a balance between accepting authorised user and rejecting imposters).
This is the code that I have done so far to calculate FRR
%%% performance calculate FAR FRR EER
% %FRR
i=36; % number of users
for n=1:i
counter1=1;
for t=0:0.01:1 % Threeshold value
FRRsingletemp=sum(all_output{1,n}(size(all_output{1},1)):size(all_output{1},2)<t)/size(all_output{1},2);
FRRsingle(counter1)=FRRsingletemp;
counter1=counter1+1;
end
FRR(n,:)=FRRsingle;
end
I am not sure what is your question but I cannot agree with your claim
EER simply can be calculated using this equation (FAR+FRR)/2
FAR (FRR) is not a value, it is a function dependent on threshold. EER is the value where FAR graph and FRR graph intersect as can be seen here.

cross validation function crossvalind

I have question please; concerning cross validation, for me the cross-validation is used to find the best parameters.
but I did not understand the role of this function "crossvalind":Generate cross-validation indices, it just takes a data set without model, like in this exemple :
load fisheriris
[g gn] = grp2idx(species);
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/3);
crossvalind() function splits your data in two groups: the training set and the cross-validation set.
By your example:
[trainIdx testIdx] = crossvalind('HoldOut', size(species,1), 1/3); means split the data in species (2/3 in the training set and 1/3 in the cross-validation set).
Supposing that your data is like:
species=[datarow1;datarow2;datarow3;datarow4;datarow5;datarow6] then
trainIdx would be like [1;1;0;1;1;0] and testIdx would be like [0;0;1;0;0;1] meaning that from the 6 total elements in our set crossvalind function assigned 4 to the train set and 2 to the cross-validation set. Of course this is a random assignment meaning that the zero and ones indices will vary every time you call the function but the proportion between them will be fixed and trainIdx + testIdx will always be ones(size(species,1),1)
crossvalind('LeaveMout',size(species,1),2) would be exactly the same as crossvalind('HoldOut', size(species,1), 1/3) in this particular case. In the 'HoldOut' format you provide parameter P which takes values from 0 to 1 (like 1/3 in the example above) while with the option 'LeaveMout' you provide integer M like 2 samples from the 6 total or like 2000 samples from the 10000 total samples in your dataset. In case of 'Resubstitution': crossvalind('Resubstitution', size(species,1), [1/3,2/3]) would be yet the same but here you also have the option of let's say [1/3,3/4] meaning that some samples can be on both the train and cross-validation sets, or even [1,1] which means that all the samples are used in both sets (trainIdx=testIdx=[1;1;1;1;1;1] in the above example). I strongly suggest to type help crossvalind and take a look at the help file which is always a lot more detailed and helpful than i could ever be.

Creating Target Values for Training Data - Neural Networks

I've been given some bacteria data and I'm supposed to use neural networks to classify the bacteria as belonging to Group A or Group B.
The bacteria dataset I've been given looks like this. There are 18 .mat Matlab datasets which are as follows: A1.mat, A2.mat, A3.mat, A4.mat, A5.mat, A6.mat, A7.mat, A8.mat, A9.mat, B1.mat, B2.mat, B3.mat, B4.mat, B5.mat, B6.mat, B7.mat, B8.mat, B9.mat.
Each of these Matlab dataset consists of a 2510 x 2 matrix. The first column is the time information and the second column is some bacteria information. I extracted only the bacteria information in column 2 between indices 900 and 1200. That was the portion I needed for my analysis. This yielded a 209 x 1 matrix.
I went on to create my input data as an 209 x 18 matrix, i.e., extracting data between 900 and 1200 indices for each of the datasets and putting everything together.
My goal in this project is to classify bacteria as belonging to Group A or Group B. From this point on, I'm at a loss on how to get the target values I need to feed into the neural network. Do I need additional information in order to proceed? That is, should the dataset have also contained target information as well? Any help at this point would be helpful. Thanks.
It sounds like you have 418 total exemplars, each with 9 features, with 209 belonging to Group A and 209 belonging to group B. For what it's worth, you'd typically want to have many, many more exemplars to train a neural network.
Instead of thinking of your classification problem as A or B, think about it as 'A' or 'not A.' So exemplars belonging to Group A have a target value of 1, and exemplars belonging to group B have a target value of 0.

How to implement data I have to svmtrain() function in MATLAB?

I have to write a script using MATLAB which will classify my data.
My data consists of 1051 web pages (rows) and 11000+ words (columns). I am holding the word occurences in the matrix for each page. The first 230 rows are about computer science course (to be labeled with +1) and remaining 821 are not (to be labeled with -1). I am going to label few part of these rows (say 30 rows) by myself. Then SVM will label the remaining unlabeled rows.
I have found that I could solve my problem using MATLAB's svmtrain() and svmclassify() methods. First I need to create SVMStruct.
SVMStruct = svmtrain(Training,Group)
Then I need to use
Group = svmclassify(SVMStruct,Sample)
But the point that I do not know what Training and Group are. For Group Mathworks says:
Grouping variable, which can be a categorical, numeric, or logical
vector, a cell vector of strings, or a character matrix with each row
representing a class label. Each element of Group specifies the group
of the corresponding row of Training. Group should divide Training
into two groups. Group has the same number of elements as there are
rows in Training. svmtrain treats each NaN, empty string, or
'undefined' in Group as a missing value, and ignores the corresponding
row of Training.
And for Training it is said that:
Matrix of training data, where each row corresponds to an observation
or replicate, and each column corresponds to a feature or variable.
svmtrain treats NaNs or empty strings in Training as missing values
and ignores the corresponding rows of Group.
I want to know how I can adopt my data to Training and Group? I need (at least) a little code sample.
EDIT
What I did not understand is that in order to have SVMStruct I have to run
SVMStruct = svmtrain(Training, Group);
and in order to have Group I have to run
Group = svmclassify(SVMStruct,Sample);
Also I still did not get what Sample should be like?
I am confused.
Training would be a matrix with 1051 rows (the webpages/training instances) and 11000 columns (the features/words). I'm assuming you want to test for the existence of each word on a webpage? In this case you could make the entry of the matrix a 1 if the word exists for a given webpage and a 0 if not.
You could initialize the matrix with Training = zeros(1051,11000); but filling the entries would be up to you, presumably done with some other code you've written.
Group is a 1-D column vector with one entry for every training instance (webpage) than tells you which of two classes the webpage belongs to. In your case you would make the first 230 entries a "+1" for computer science and the remaining 821 entries a "-1" for not.
Group = zeros(1051,1); % gives you a matrix of zeros with 1051 rows and 1 column
Group(1:230) = 1; % set first 230 entries to +1
Group(231:end) = -1; % set the rest to -1

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end