How to handel Imbalance data while using LSTM - neural-network

I am doing a project on an online signature verification system using RNNs LSTM. In the project, I am facing a problem while using the signatures as LSTM training data. I was using the SVC 2004 dataset where there are 40 signatures of each user. 20 are genuine and 20 are forged. among these signatures, each one is not of equal length. Some have 130 responses, some have 150 responses (rows) (the number of the column is same but the number of rows is different in each signature data). Both are signatures of the same person and I have to use both of them as training data. But each row is crucial so I can not downsample the data. also upsampling the data can affect the time dependency. Then How can I adjust this imbalance in the signatures? If anyone helps me solve this problem I will be really grateful to him/her. Thank you.

Related

Predict Test Sample Response for SVM Regression

I am running test on data samples using the example of SVM Regression Model, in the case of the example given in this MathWorks documentation (link: https://uk.mathworks.com/help/stats/compactregressionsvm.predict.html#buvytaz) the training data needs to have the same number of rows as the predict data, this is required so far to be able to run the prediction. What can I do if my data varies from the number of rows? How can I train my support vector machine with a data that have different number of samples and still be able to predict with the consequence of having maybe bigger error?
Data sample of the training data for the model and the data that I want to use for Mdl = fitrsvm.
ans=10×2 table
Training data Data to predict
___________ ____________
14 9.4833
27 28.938
10 7.765
28
22 21.054
29 31.484
24.5 30.306
18.5
32 28.225
28
Step by step verification of what I wanted to do:
What I did was:
1. Built a model
2. Test it with YFit
3. Modify the table and it did work!.
4. I doubled the size of the table to predict and it did work!.
I did something wrong before.
You can't train your model with unlabelled data, i.e. that has no "predict" value. I would suggest you just filter out all the unlabelled data points and train the model on this subset.
From intuition this just represents the fact, that you cannot learn from these data points. If I want to learn the relationship between age -> income, it does not help me at all to ask someone JUST their age and not their income. The information is useless to answer my question.

Is Matlab incorrect for mnrfit?

It seems Matlab is giving incorrect results for multinomial logistic regression.
In their example documentation using Fisher's Iris dataset [link], they give coefficients for the model which can be used on the same data set itself to get the modeled probabilities.
load fisheriris
sp = categorical(species);
[B,dev,stats] = mnrfit(meas,sp);
PHAT=mnrval(B,meas);
However, none of the expected value aggregates match the population aggregates which is a requirement for a MaxEnt classifer (See slide 35 [here], or Eq 14 [here], or Agresti "Categorical Data Analysis" pg 298, etc.)
For example
>> sum(PHAT)
>> 49.9828 49.8715 50.1456
should all equal 50 (population values), likewise for other aggregations
If the parameters
B=[36.9450 42.6378
12.2641 2.4653
14.4401 6.6809
-30.5885 -9.4294
-39.3232 -18.2862]
were used instead then all aggregated sufficient statistics match.
Additionally it seems odd that Matlab is solving it with likelihoods, which can produce an error,
Warning: Maximum likelihood estimation did not converge. Iteration
limit exceeded. You may need to merge categories to increase observed
counts
where the only requirement, proved by MLE consideration, is that the expected values match and no likelihood evaluation is needed.
It would be a nice feature that if instead of true classes are given we can give an option for including just the aggregate information.
Submitted a technical error review within Mathworks website. Their reply:
Hello [----],
I am writing in reference to your Technical Support Case #01820504
regarding 'mnrfit'.
Thanks a lot for your patience and reporting this issue. This appears
to be unexpected behavior. It appears to be related to an existing
issue we have in our records, that "mnrfit" does not give correct
maximum likelihood estimates in certain cases. Since the "mnrfit"
function is not finding the maximum likelihood estimates for the
coefficients, we calculated the actual MLEs. When we use these
estimates, we get the desired result of all 50s in this case.
The issue is that, for this particular dataset in our example, the
classes can be separated perfectly. This means that the logistic
function, in order to get exact zero or one probabilities, needs to
have infinite coefficients. The "mnrfit" function carries out an
iterative procedure with the coefficients getting larger, but it stops
at a point where the results have the issue that you have found. We
certainly agree that "mnrfit" could be made to do better. Our
development team is working on it.
At this stage, I am not able to suggest a workaround other than to
write a custom implementation as my colleague and I had tried. For
now, I will be closing this request as I have already forwarded it to
our records. However, if you have any additional questions related to
this case, please do not hesitate to reach me.
Sincerely,
[----]
MathWorks Technical Support Department

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Weka classification; cross-validation across predefined topic

I am using Weka to classify a dataset. Each data point is in one of five topics that I am trying to generalize across.
I would like to make each topic a test set so that I can train on topics 1-4 and test on topic 5, then train on topics 1, 3, 4 and 5, and test on 2, and so on.
Is there a way to direct Weka to preform this automatically one time with one dataset? That is, can I direct Weka to cross-validate by topic?
I apologize for redundancy if this question has already been asked. If it indeed has, any help in directing me towards the answer would be most appreciated.
Thanks!
There are a few ways that I can think of that may assist in getting the results that you desire:
As you have outlined in your question, you could generate 5 different training sets with the remaining topic as the testing set. Each model would need to be trained individually if you were going to use the Weka interface (Supply the training data, the build a classifier and supply a testing set, repeat). This would likely be quickest if it's a once off.
You may be able to use the FilteredClassifier with the filter of RemoveWithValues. This may be able to remove the training cases of a particular topic if the topic number is an available attribute (I am guessing that this data is not part of the model's data though, so attribute filtering may also be required if using this approach).
If you are willing to use Java to program a solution, you would be able to manipulate the data and build each of the five classifiers in one go. I am thinking that the algorithm for such a model would be as outlined below. If you plan to undertake this process a lot, it may be the better solution.
Algorithm:
for each topic t
training_data = all cases not containing topic t
testing_data = training_set cases containing topic t
build classifier using training_data, testing_data
save classifier
end for

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!