cross validation process - matlab

I am working on a voice morphing system. I have source speech signal (divided into test, train & validation) and target speech signal (divided into test, train and validation data). Now I'am designing a radial basis neural network with 3 fold cross validation to find the morphed speech wavelet coefficients. I need to initialize the net with source and target training data and perform 3 fold cross validation using training and validation samples.
I think that as per the cross validation I need to divide my data set into 3 parts and then use 2 of them for training and the other for testing. (Repeating the process for all the folds). Now the problem is that I want to know that weather I need to divide my source training data into 3 parts or the target training...??
Thus I need to know how to apply the cross validation? Can anyone please elaborate the process for me?

You should randomly divide your entire data (tuples of input ["source"] and output ["target"/"morphed"] observations) into 3 sets: training, cross-validation, and test.
The training set will be used to train each neural network you try. The cross-validation set will be used after each network is trained, in order to select the best parameters (# of hidden nodes, etc, etc). The test set is used at the end to validate the overall performance (i.e. accuracy, generalization, etc.) of the final model.

Related

Testing of neural network in pattern recognition

I've implemented a neural network for the pattern recognition.
From the two classes of images I've used SIFT feature + BOvW to make the image descriptor.
After training and validation I've got confusion matrix with accuracy 80% overall.
here I am having two doubts
(I've done it in MATLAB GUI)
1)when ever I am training the network it gives different values (confusion matrix) in that case how can I validate the network? (MATLAB documentation says its due to different initial conditions how can I set one then?, I am not changing division of images)
2) how can I use this network to tell the category of the image if I input one, after training and validation

Concept of validate for neural network

I have a problem with concept of Validation for NN. suppose I have 100 set of input variables (for example 8 input, X1,...,X8) and want to predict one Target(Y). now I have two ways to use NN:
1- use 70 set of data for training NN and then use trained NN to predict other 30 sets of Target for validation and then plot output VS Target for this 30 sets as validation plot.
2- use 100 sets of data for training NN and then divide all outputs to two part (70% and 30%). plot 70% of outputs VS corresponding Targets as Training plot. then plot other 30% outputs VS their corresponding Targets as validation plot
Which one is correct??
Also, what the difference between checking NN with new data set and validation data set??
Thanks
You cannot use data for validation, if it has been already used for the training, because the trained NN will already "know" your validation examples. The result of such validation will be very biased. I would for sure use the first way.

why we need cross validation in multiSVM method for image classification?

I am new to image classification, currently working on SVM(support Vector Machine) method for classifying four groups of images by multisvm function, my algorithm every time the training and testing data are randomly selected and the performance is varies at every time. Some one suggested to do cross validation i did not understand why we need cross validation and what is the main purpose of this? . My actual data set consist training matrix size 28×40000 and testing matrix size 17×40000. how to do cross validation by this data set help me. thanks in advance .
Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. As a result, you would like to select the model with the least validation error. Here the model refers to the features you want to use, and of more importance, the gamma and C in your SVM. After cross validation, you will use the selected gamma and C with the least average validation error to train the whole training data.
You may also need to estimate the performance of your features and parameters to avoid both high-bias and high-variance. Whether your model suffers underfitting or overfitting can be observed from both in-sample-error and validation error.
Ideally 10-fold is often used for cross validation.
I'm not familiar with multiSVM but you may want to check out libSVM, it is a popular, free SVM library with support for a number of different programming languages.
Here they describe cross validation briefly. It is a way to avoid over-fitting the model by breaking up the training data into sub groups. In this way you can find a model (defined by a set of parameters) which fits both sub groups optimally.
For example, in the following picture they plot the validation accuracy contours for parameterized gamma and C values which are used to define the model. From this contour plot you can tell that the heuristically optimal values (from those tested) are those that give an accuracy closer to 84 instead of 81.
Refer to this link for more detailed information on cross-validation.
You always need to cross-validate your experiments in order to guarantee a correct scientific approach. For instance, if you don't cross-validate, the results you read (such as accuracy) might be highly biased by your test set. In an extreme case, your training step might have been very weak (in terms of fitting data) and your test step might have been very good. This applies to ALL machine learning and optimization experiments, not only SVMs.
To avoid such problems just divide your initial dataset in two (for instance), then train in the first set and test in the second, and repeat the process invesely, training in the second and testing in the first. This will guarantee that any biases to the data are visible to you. As someone suggested, you can perform this with even further division: 10-fold cross-validation, means dividing your data set in 10 parts, then training in 9 and testing in 1, then repeating the process until you have tested in all parts.

Accuracy of Neural network Output-Matlab ANN Toolbox

I'm relatively new to Matlab ANN Toolbox. I am training the NN with pattern recognition and target matrix of 3x8670 containing 1s and 0s, using one hidden layer, 40 neurons and the rest with default settings. When I get the simulated output for new set of inputs, then the values are around 0 and 1. I then arrange them in descending order and choose a fixed number(which is known to me) out of 8670 observations to be 1 and rest to be zero.
Every time I run the program, the first row of the simulated output always has close to 100% accuracy and the following rows dont exhibit the same kind of accuracy.
Is there a logical explanation in general? I understand that answering this query conclusively might require the understanding of program and problem, but its made of of several functions to clearly explain. Can I make some changes in the training to get consistence output?
If you have any suggestions please share it with me.
Thanks,
Nishant
Your problem statement is not clear for me. For example, what you mean by: "I then arrange them in descending order and choose a fixed number ..."
As I understand, you did not get appropriate output from your NN as compared to the real target. I mean, your output from NN is difference than target. If so, there are different possibilities which should be considered:
How do you divide training/test/validation sets for training phase? The most division should be assigned to training (around 75%) and rest for test/validation.
How is your training data set? Can it support most scenarios as you expected? If your trained data set is not somewhat similar to your test data sets (e.g., you have some new records/samples in the test data set which had not (near) appear in the training phase, it explains as 'outlier' and NN cannot work efficiently with these types of samples, so you need clustering approach not NN classification approach), your results from NN is out-of-range and NN cannot provide ideal accuracy as you need. NN is good for those data set training, where there is no very difference between training and test data sets. Otherwise, NN is not appropriate.
Sometimes you have an appropriate training data set, but the problem is training itself. In this condition, you need other types of NN, because feed-forward NNs such as MLP cannot work with compacted and not well-separated regions of data very well. You need strong function approximation such as RBF and SVM.

Matlab neural network testing

I have created a neural network and the performance is good. By using nprtool, we are allow to test the network with an input data and target data. Here is my question, what is the purpose of testing a neural network with target data provided? Isn't it testing should not hav e target data so that we can know how well can the trained neural network perform without target data is given? Hope someone will respond to this, thanks =)
I'm not familiar with nprtool, but I suspect it would give the input data to your neural network, and then compare your NN's output data with the target data (and compute some kind of success rate based on that).
So your NN will never see the target data, it's just used to measure the performance.
It's like the "teacher's edition" of the exercise books in school. The student (i.e. the NN) doesn't have the solutions, but her/his answers will be compared against them by the teacher (i.e. nprtool). (Okay, the teacher probably/hopefully knows the subject, but you get the idea.)
The "target" data t is the desired y of y=net(x) used as example to train the network.
What nprtool do is to divide the training set into three groups: the training set, the validation set and the test set.
The first one is used to actually update the network.
The second one is used to determine the performances of the net (note: this set is NOT used in any way to update the network): as the NN "learns" the error (as difference between the t and net(x)) over the validation set decreases. The trend will eventually stop or even reverse: this phenomena is called "overfitting", which means the NN is now chasing the training set, "memorizing" it at the cost of the ability to generalize (meaning: to perform well with unseen data). So the purpose of this validation set is to determine when to stop the training before the NN starts overfitting. This should answer your question.
Finally third set is for external testing, to leave you a set of data untouched by the training procedure.
Even though the total data set [training, validation and testing] are inputs to the training algorithm, the testing data is in no way used to design (i.e., train and validate) the net
total = design + test
design = train + validate
The training data is used to estimate weights and biases
The validation data is used to monitor the design performance on nontraining data. REGARDLESS OF THE PERFORMANCE ON TRAINING DATA, if validation performance degrades continuously for 6 (default) epochs, training is terminated (VALIDATION STOPPING).
This mitigates the dreaded phenomenon of OVERTRAINING AN OVERFIT NET where performance on nontraining data degrades even if the training set performance is improving.
An overfit net has more unknown weights and biases than training equations, thereby allowing an infinite number of solutions. A simple example of overfitting with two unknowns but only one equation:
KNOWN: a, b, c
FIND: unique x1 and x2
USING: a * x1 + b * x2 = c
Hope this helps.
Greg