Partitioning sample for cross validation within subsets of the sample (by school) (Matlab) - matlab

I need to partition a data set 5 ways for 5 fold cross validation using matlab. However, the dataset pertains to students nested in schools, so instead of splitting the entire data set 5 ways, I want to split each school 5 ways. After building 5 training and 5 test data sets I want to be able to tell which observations in each data set belong to each school.
In the full data set I have a feature vector x (28x15363) and target vector y (1x15363). I also have an index vector (task_indexes) that identifies the first observation for each school.
I would like to build a set of 5 training set indexes and 5 corresponding test set indexes based on a 5 fold partition of each school, and I will need a set of task indexes, 5 tr_indexes which identify the first observation of each school in each training set, and tsk_indexes which do the same for the test set.
The training set indexes tr, test set indexes tst, training set task indexes tr_indexes and test set task indexes tst_indexes should fit into the following code.
gammas = [.1, .01, .001, .0001, .00001]
errgrid = zeros(5,1)
for g = 1:length(gammas)
gammas = gammas(4)
[testerrs,theW,theD] = ...
run_code_example(gammas, x(:, tr), y(tr), x(:, tst), y(tst), tr_indexes, tst_indexes, Dini, 10, 'feat', 0, fname);
errgrid(g) = testerrs
gammaselect = gammas(min(errgrid))
In this code, I use a partition provided by the authors with training set indexes tr and test indexes tst, and task indexes tr_indexes for the training set and tst_indexes for the test set.
Any advice on how to carry out this partition?
I tried to write this post to the standard of writing a good question. If anyone can provide feedback to improve the question it is greatly appreciated.

Related

training neural network for user recognition in MATLAB

I'm working on gait recognition problem, the aim of this study is to be used for user authentication
I have data of 36 users
I've successfully extracted 143 features for each sample (or example) which are (36 rows and 143 columns) for each user
( in other words, I have 36 examples and 143 features are extracted for each example. Thus a matrix called All_Feat of 36*143 has been created for each individual user).
By the way, column represents the number of the extracted features and row represents the number of samples (examples) for each feature.
Then I have divided the data into two parts, training and testing (the training matrix contains 25 rows and 143 columns, while the testing Matrix contains 11 rows and 143 columns).
Then, for each user, I divided the matrix (All_Feat) into two matrixes ( Training matrix, and Test matrix ).
The training matrix contains ( 25 rows (examples) and 143 columns), while the testing matrix has (11 rows and 143 columns).
I'm new to classification and this stuff
I'd like to use machine learning (Neural Network) for classifying these features.
Therefore, the first step I need to create a reference template for each user ( which called training phase)
this can be done by training the classifier with the user's features (data) and the remaining Users as well (35 users are considered as imposters).
Based on what I have read, training Neural Network requires two classes, the first class contains all the training data of genuine user (e.g. User1) and labelled with 1 , while the second class has the training data of imposters labelled as 0 (which is binary classification, 1 for the authorised user and 0 for imposters).
**now my question is: **
1- i dont know how to create these classes!
2- For example, if I want to train Neural Network for User1, I have these variables, input and target. what should I assign to these variables?
should input= Training matrix of User1 and Training matrixes of User2, User3,.....User35 ?
Target=what should i assign to this matrix?
I really appreciate any help!
Try this: https://es.mathworks.com/help/nnet/gs/classify-patterns-with-a-neural-network.html
A few notes:
You said that, for each user, you have extracted 136 features. It sounds like you have only one repetition for each user (i.e. the user has tried-used the system once). However, I don't know the source of your data, but I dunno that it hasn't got some type of randomness. You mention gait analysis, and that sounds like that the recorded data of a given one user will be different each time that user uses the system. In other words: the user uses your system, you capture the data, you extract the 136 features (numbers); then, the user uses again the system, but the extracted 136 features will be slightly different. Therefore, you should get several examples for each user to train the classifier. In terms of "matlab matrix" your matrix should have one COLUMN for each example, and 136 rows (each of you features). Since you should have several repetitions for each user (for example 10 times), your big matrix should be something like: 136 rows x 360 columns.
You should "create" one new neural network for each user. Given a user (for example User4), you create a dataset (a new matrix) with samples of that user, and samples of several other users (User1, User3, User5...). You do a binary classification (cases: "user4" against "other users"). After training, it would be advisable to test the classifier with data of other users whose data was not present during the training phase (for example User2 and others). Since you are doing a binary classification your matrices should be somthing like follows:
Example, you have 10 trials (examples) of each user. You want to create a neural network to detect the user User1. The matrix should be like:
(notation cU1_t1 means: column with features of user 1, trial 1)
input_matrix = [cU1_t1; cU1_t2; ...; cU1_t10; cU2_t1; ...; cU36_t10]
The target matrix should be like:
target = a matrix whose 10 first columns are [ 1, 0], and the other 350 columns are [0, 1]. That means that the first 10 columns are of type A, and the others of type B. In this case "type A" means "User1", and "type B" means "Not User1".
Then, you should segment the data (train data, validation data, test data) to train the nerual network and so on. Remember to save some users just for the testing phase, for example, the train matrix should not have any of the columns of five users: user2, user6, user7, user10, user20 (50 columns).
I think you get the idea.
Regards.
************ UPDATE: ******************************
This example assumes that the user selects/indicates its name and then the system uses the neural network to authenticate the user (like a password). I will give you an small example with random numbers.
Let's say you have recorded data from 15 users (but in the future you will have more). You record "gait data" from them when they do something with your recording device. From the recorded signals you extract some features, let's say you extract 5 features (5 numbers). Hence, everytime a user uses the machine you get 5 numbers. Even if user is the same, the 5 numbers will be different each time, because the recorded signals have some randomness. Therefore, to train the neural network you have to have several examples of each user. Let's say that you have 18 repetitions performed by each user.
To sum up this example:
There are 15 users available for the experiment.
Each time the user uses the system you record 5 numbers (features). You get a feature vector. In matlab it will be a COLUMN.
For the experiment each user has performed 18 repetitions.
Now you have to create one neural network for each user. To that end, you have to construct several matrices.
Let's say you want to create the neural network (NN) of user 2 (U2). The NN will classify the feature vectors in 2 classes: U2 and NotU2. Therefore, you have to train and test the NN with examples of this. The group NotU2 represents any other user that it is not U2, however, you should NOT train the NN with data of every other user that you have in your experiment. This will be cheating (think that you can't have data from every user in the world). Therefore, to create the train dataset you will exclude all the repetitions of some users to test the NN during the training (validation dataset) and after the trainning (test dataset). For this example we will use users {U1,U3,U4} for validation, and users {U5,U6,U7} for testing.
Therefore you construct the following matrices:
Train input matrix
It wil have 12 examples of U2 (70% more or less) and every example of users {U8,U9,...,U14,U15}. Each example is a column, hence, the train matrix will be a matrix of 5 rows and 156 columns (12+8*18). I will order it as follows: [U2_ex1, U2_ex2, ..., U2_ex12, U8_ex1, U8_ex2, ..., U8_ex18, U9_ex1, ..., U15_ex1,...U15_ex18]. Where U2_ex1 represents a column vector with the 5 features obtained of User 2 during the repetition/example number 1.
-- Target matrix of train matrix. It is a matrix of 2 rows and 156 columns. Each column j represents the correct class of the example j. The column is formed by zeros, and it has a 1 at the row that indicates the class. Since we have only 2 classes the matrix has only 2 rows. I will say that class U2 will be the first one (hence the column vector for each example of this class will be [1 0]), and the other class (NotU2) will be the second one (hence the column vector for each example of this class will be [0 1]). Obviously, the columns of this matrix have the same order than the train matrix. So, according to the order that I have used, the target matrix will be:
12 columns [1 0] and 144 columns [0 1].
Validation input matrix
It will have 3 examples of U2 (15% more or less) and every example of users [U1,U3,U4]. Hence, this will be a matrix of 6 rows and 57 columns ( 3+3*18).
-- Target matrix of validation matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
Test input matrix
It will have the remaining 3 examples of U2 (15%) and every example of users [U5,U6,U7]. Hence, this will be a matrix of 6 rows and 57 columns (3+3*18).
-- Target matrix of test matrix: A matrix of 2 rows and 57 columns: 3 columns [1 0] and 54 columns [0 1].
IMPORTANT. The columns of each matrix should have a random order to improve the training. That is, do not put all the examples of U2 together and then the others. For this example I have put them in order for clarity. Obviously, if you change the order of the input matrix, you have to use the same order in the target matrix.
To use MATLAB you will have to to pass two matrices: the inputMatrix and the targetMatrix. The inputMatrix will have the train,validation and test input matrices joined. And the targetMatrix the same with the targets. So, the inputMatrix will be a matrix of 6 rows and 270 columns. The targetMatrix will have 2 rows and 270 columns. For clarity I will say that the first 156 columns are the trainning ones, then the 57 columns of validation, and finally 57 columns of testing.
The MATLAB commands will be:
% Create a Pattern Recognition Network
hiddenLayerSize = 10; %You can play with this number
net = patternnet(hiddenLayerSize);
%Specify the indices of each matrix
net.divideFcn = 'divideind';
net.divideParam.trainInd = [1: 156];
net.divideParam.valInd = [157:214];
net.divideParam.testInd = [215:270];
% % Train the Network
[net,tr] = train(net, inputMatrix, targetMatrix);
In the open window you will be able to see the performance of your neural network. The output object "net" is your neural network trained. You can use it with new data if you want.
Repeat this process for each other user (U1, U3, ...U15) to obtain his/her neural network.

How to use sample weights for a random forest classificator in Orange?

I am trying to train a random forest classificator on a very imbalanced dataset with 2 classes (benign-malign).
I have seen and followed the code from a previous question (How to set up and use sample weight in the Orange python package?) and tried to set various higher weights to the minority class data instances, but the classificators that I get work exactly the same.
My code:
data = Orange.data.Table(filename)
st = Orange.classification.tree.SimpleTreeLearner(min_instances=3)
forest = Orange.ensemble.forest.RandomForestLearner(learner=st, trees=40, name="forest")
weight = Orange.feature.Continuous("weight")
weight_id = -10
data.domain.add_meta(weight_id, weight)
data.add_meta_attribute(weight, 1.0)
for inst in data:
if inst[data.domain.class_var]=='malign':
inst[weight]=100
classifier = forest(data, weight_id)
Am I missing something?
Simple tree learner is simple: it's optimized for speed and does not support weights. I guess learning algorithms in Orange that do not support weight should raise an exception if the weight argument is specified.
If you need them just to change the class distribution, multiply data instances instead. Create a new data table and add 100 copies of each instance of malignant tumor.

Increase number of processed samples in Simulink

I'm pretty new to Simulink and I was wondering if the following thing was possible somehow:
I have a signal of let's say 10000 data points.
On this signal I want to run a certain algorithm, however said algorithm needs exactly 1000 samples to work properly.
Now with normal matlab functions this is no problem. You cut the signal in 10 pieces, perform the algorithm for each one, stitch the processed parts back together and you get your result.
In Simulink however this creates sort of a problem, since (to my understanding right now) Simulinks blocks work sample per sample (one sample in, one sample out). So I don't have the necessary data to perform the algortihm within a block.
Is there any way to increase the number of processed samples per block?
Reshape 10,000 data points with 1000 samples and create Column wise data
lets say
data = [1 2 3 4 5 6], convert
data = [1 4
2 5
3 6]
Now, define sample time for lets say 1 sec, use fromworkspace blocks. In this example 2 (No of column), with each workspace block format a array as [t data(:,1)],[t data(:,2)] where t = [1 2 3], multiples of sampletime.
On Simulink Model, set running time equal to 3 sec as there are 3 data points and store output via to workspace block

Interactive Report - aggregate sum of multiple columns from one table multiply by values from another Table

I have a challenge in Oracle Apex - I would to sum multiple columns to give 3 extra rows namely points, Score, %score. There are more columns but I'm only choosing a few for now.
Below is an example structure of my data:
Town | Sector | Outside| Inside |Available|Price
Roy-----Formal----0----------0----------1------0
Kobus --Formal----0 ---------0--------- 1------0
Wika ---Formal----0----------0--------- 1------0
Mevo----Formal----1----------1----------1------0
Hoch----Formal----1----------1----------1------1
Points------------2----------2----------5------1
Score------------10---------10---------10------10
%score-----------20---------20---------50------10
Each column has a constant weighting (which serves as a factor and it can change depending on the areas). In this case, the weighting for the areas are in the first row of the sector Formal:
Sector |Outside| Inside |Available|Price
Formal----1----------1 ----------1-----1
Informal--1----------0 ----------2-----1
I tried using the aggregate sum function in apex but it wont work because I need the factor in the other table. This is where my challenge began.
To compute the rows below the report
points = sum per column * weighting factor per column
Score = sum of no of shops visited (in this case its 5) * weighting factor per column
% score = points/Score * 100
The report should display as described above. With the new computed rows below.
I kindly ask anyone to assist me with this challenge as I have tried searching for solutions but haven't come across any.
Thanks a lot for your support in advance!!

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.