Traing dataset,validation dataset,testing dataset in Matlab - matlab

I am very new in Matlab and that too in Neural network.. I have 4*81 input dataset and 1*81 output/target dataset. 'divideblock' or 'dividerand' randomly split the dataset into training, validation and testing.My question is that... After training and simulation... how to trace the individual input dataset(training, testing, validation) which are used to train the network.
so that i can able to find the error of the input dataset for testing, validation individually..
thanks in advance for any suggestion...

Use trainInd,valInd,testInd:
[trainInd,valInd,testInd] = dividerand(Q,trainRatio,valRatio,testRatio);
see http://www.mathworks.com/help/toolbox/nnet/ref/dividerand.html .

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Partitioning dataset into train, test and validate subset (Matlab)

Im new to using Matlab and i'm trying to achieve the following situation:
I've one dataset of 7000+ entries. The goal is to train a classification tree (fitctree) on this data. I've seperated the data into a matrix with observations (predictors) and a matrix with classes(class). To partition the data i'm using cvpartition. Everything works fine up until this point.
Problem: I want to create three subsets with data: 1 training set, 1 validation set and 1 testing set. I want to train the tree using the training set, and validate its performance using the validation set. After tweeking the parameters I want to run the final test on the test data partition.
To partition the data I tried creating a cvpartition, which works, e.g.
cvpart = cvpartition(class, 'k', 10);
and then performing another cvpartition on that testing set, seperating this into another two sets:
cvpart2 = cvpartition(cvpart.TestSize, 'k', 10);
Sadly, when validating the performance of the tree this doesn't seem to work. When I skip the seond cvpartition, and validate the performance on the test set of cvpart, the model performs perfectly.
Update: after days I found that it seems to work when using it in this way:
cvpart2 = cvpartition(cvpart.TrainSize, 'k', 10);
Anyone care to explain me why it does work in this way, but not when using the test set?
Hope you guys can help me out;)
Kind regards.

how to compare the expected values with actual values using matlab's neural network toolbox?

I have been using matlab's neural network toolbox lately for my research. I created some neural networks using fitting tool of this toolbox. Now I encountered a problem while testing the network with new data. Basically, I want to test the network with some new data, which do not have the same sample number as the network. For example, I have created the network with the input of 12x36 matrix (12 variables, 36 samples) and output of 1x36 vector (1 variable, 36 samples). Now, I want to get the results for a new data (12x11520 (12 variables and 11520 samples)). However, when I put this data into the network and get the results, I only get an output vector of 1x36 (1 variable 36 samples) while I was expecting to get the results as 1x11520 (1 variable and 11520 samples). I am using the below line to get the results from the neural network named as network.
output_estimate = network(input');
I also tried the line below to get the results but the result did not change.
output_estimate = sim(network,input');
Could you please help me understand this and get the output as the same sample length so that I can compare the expected and actual results for the new data.
Thank you very much in advance.
Irem

Matlab: How can I store the output of "fitcecoc" in a database

In Matlab help section, there's a very helpful example to solve classification problems under "Digit Classification Using HOG Features". You can easily execute the full script by clikcing on 'Open this example'. However, I'm wondering if there's a way to store the output of "fitcecoc" in a database so you don't have to keep training and classifying each and everytime you run the code. Here is the section of the code that's relevant to my question:
% fitcecoc uses SVM learners and a 'One-vs-One' encoding scheme.
classifier = fitcecoc(trainingFeatures, trainingLabels);
So, all I want to do is store 'classifier' in a database and retrieve it for the following code:
predictedLabels = predict(classifier, testFeatures);
Look at Database Toolbox in Matlab.
You could just save the classifier variable in a file:
save('classifier.mat','classifier')
And then load it before executing predict:
load('classifier.mat')
predictedLabels = predict(classifier, testFeatures);

Pybrain outputs same result for any input

I am trying to train a simple neural network with Pybrain. After training I want to confirm that the nn is working as intended, so I activate the same data that I used to train it with. However every activation outputs the same result. Am I misunderstanding a basic concept about neural networks or is this by design?
I have tried altering the number of hidden nodes, the hiddenclass type, the bias, the learningrate, the number of training epochs and the momentum to no avail.
This is my code...
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
net = buildNetwork(2, 3, 1)
net.randomize()
ds = SupervisedDataSet(2, 1)
ds.addSample([77, 78], 77)
ds.addSample([78, 76], 76)
ds.addSample([76, 76], 75)
trainer = BackpropTrainer(net, ds)
for epoch in range(0, 1000):
error = trainer.train()
if error < 0.001:
break
print net.activate([77, 78])
print net.activate([78, 76])
print net.activate([76, 76])
This is an example of what the results can be... As you can see the output is the same even though the activation inputs are different.
[ 75.99893007]
[ 75.99893007]
[ 75.99893007]
I had a similar problem, I was able to improve the accuracy (I.E. get different answer for each input) by doing the following.
Normalizing/Standardizing input and output to the neural network
Doing this allows the neural network to more accurately determine the internal weights and values for the network in order to predict the answers
heres an article that explains it in more detail. http://visualstudiomagazine.com/articles/2014/01/01/how-to-standardize-data-for-neural-networks.aspx
In the end I solved this by normalizing the data between 0 and 1 and also training until the error rate hit 0.00001. It takes much longer to train, but I do get accurate results now.