I built following neural network with help of Encog library for Java
network.addLayer(new BasicLayer(DataCooker.DATA_SIZE));
network.addLayer(new BasicLayer(DataCooker.DATA_SIZE));
network.addLayer(new BasicLayer(DataCooker.DATA_SIZE));
network.addLayer(new BasicLayer(DataCooker.DATA_SIZE));
network.addLayer(new BasicLayer(1));
network.getStructure().finalizeStructure();
network.reset();
Also, I prepared test-data and tried to train this NN with help of this train
Train train = new ResilientPropagation(network, trainingSet);
But, i found that on some data case (rare) I am receiving train.getError() equals to Infinity (not depends how many epochs passed).
Data looks good from first glance (all are numbers, no NaN, no Infinity values).
What is possible reasons for this Infinity error? What can I do to solve it ?
Thanks
sorry, data was bad indeed, output for bad cases was Infinity
Related
I made a neural network whice i want to classify the input data (400 caracteristics per input data) as one of the five arabic dialects. I divede the trainig data in "train data", "validation data" and than "test date", with net.divideFcn = 'dividerand'; . I use trainbr as training function, whice results in a long training, that's because i have 9000 elements in training data.
For the network arhitecture i used two-layers, first with 10 perceptrons, second with 5, 5 because i use one vs all strategy.
The network training ends usually with minimum gradient reached, rather than minimux error.
How can i make the network predict better? Could it be o problem with generalization (the network learn very well the training data, but test on new data tends to fail?
Should i add more perceptrons to the first layer? I'm asking that because i take about a hour to train the network when i have 10 perceptrons on first layer, so the time will increase.
This is the code for my network:
[Test] = load('testData.mat');
[Ex] = load('trainData.mat');
Ex.trainVectors = Ex.trainVectors';
Ex.trainLabels = Ex.trainLabels';
net = newff(minmax(Ex.trainVectors),[10 5] ,{'logsig','logsig'},'trainlm','learngdm','sse');
net.performFcn = 'mse';
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.95;
net.trainParam.epochs = 1000;
net.trainParam.goal = 0;
net.trainParam.max_fail = 50;
net.trainFcn = 'trainbr';
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.valRatio = 0.15;
net.divideParam.testRatio = 0.15;
net = init(net);
net = train(net,Ex.trainVectors,Ex.trainLabels);
Thanks !
Working with neural networks is some type of creative work. So noone can't give you the only true answer. But I can give some advices based on my own experience.
First of all - check the network error when training ends (on training and validation data sets. Before you start to use test data set). You told it is minimum but what is its actual value? If it 50% too, so we have bad data or wrong net architecture.
If error for train data set is OK. Next step - lets check how much the coefficients of your net are changing at the validation step. And what's up about the error here. If they changed dramatically that's the sigh our architecture is wrong: Network does not have the ability to generalize and will retrain at every new data sets.
What else can we do before changing architecture? We can change the number of epochs. Sometimes we can get good results but it is some type of random - we must be sure the changing of coefficient is small at the ending steps of training. But as I remember nntool check it automatically, so maybe we can skip this step.
One more thing I want to recommend to you - change train data set. Maybe you know rand is give you always the same number at start of matlab, so if you create your data sets only once you can work with the same sets always. This problem is also about non-homogeneous data. It can be that some part of your data is more important than other. So if some different random sets will give about the same error data is ok and we can go further. If not - we need to work with data and split it more carefully. Sometimes I avoid using dividerand and divide data manually.
Sometimes I tried to change the type of activation function. But here you use perceptron... So the idea - try to use sigma- or linear- neurons instead of perceptrons. This rarely leads to significant improvements but can help.
If all this steps can't give you enough you have to change net architecture. And the number of neurons in the first layer is the first you have to do. Usually when I work on the neural network I spend a lot of time trying not only different number of neurons but the different types of nets too.
For example, I found interesting article about your topic: link at Alberto Simões article. And that's what they say:
Regarding the number of units in the hidden layers, there are some
rules of thumb: use the same number of units in all hidden layers, and
use at least the same number of units as the maximum between the
number of classes and the number of features. But there can be up to
three times that value. Given the high number of features we opted to
keep that same number of units in the hidden layer.
Some advices from comments:
Data split method (for train and test data sets) depends on your data. For example, I worked on industry data and found that at the last part of the data set technological parameters (pressure for some equipment) was changed. So I have to get data for both operation modes to train data set. But for your case I don't thing there are the same problem... I recommend you to try several random sets (just check they are really different!).
For measuring net error I usually calculate full vector of errors - I train net and then check it's work for all values to get the whole errors vector. It's useful to get some useful vies like histograms and etc and I can see where my net is go wrong. It is not necessary and even harmful to get sse (or mse) close to zero - usually that's mean you already overtrain the net. For the first approximation I usually try to get 80-95% of correct values on training data set and then try the net on the test data set.
I am using xgb.train() in xgboost R package to fit a classification model. I am trying to figure out what's the best iteration to stop the tree. I set the early_stop_rounds=6 and by watching each iteration's metrics I can clearly see that the auc performance on validation data reach the max and then decrease. However the model does not stop and keep going until the specified nround is reached.
Question 1: Is it the best model (for given parameter) defined at iteration when validation performance start to decrease?
Question 2: Why does the model does not stop when auc on validation start to decrease?
Question 3: How does Maximize parameter=FALSE mean? What will make it stop if it is set to FALSE? does it have to be FALSE when early_stop_round is set?
Question 4: How does the model know which one is the validation data in the watchlist? I've seen people use test=,eval=, validation1= etc?
Thank you!
param<-list(
objective="binary:logistic",
booster="gbtree",
eta=0.02, #Control the learning rate
max.depth=3, #Maximum depth of the tree
subsample=0.8, #subsample ratio of the training instance
colsample_bytree=0.5 # subsample ratio of columns when constructing each tree
)
watchlist<-list(train=mtrain,validation=mtest)
sgb_model<-xgb.train(params=param, # this is the modeling parameter set above
data = mtrain,
scale_pos_weight=1,
max_delta_step=1,
missing=NA,
nthread=2,
nrounds = 500, #total run 1500 rounds
verbose=2,
early_stop_rounds=6, #if performance not improving for 6 rounds, model iteration stops
watchlist=watchlist,
maximize=FALSE,
eval.metric="auc" #Maximize AUC to evaluate model
#metric_name = 'validation-auc'
)
Answer 1: No, not the best but good enough from bias-variance
tradeoff point of view.
Answer 2: It works, may be there is some problem with your code. Would you please give share the progress output of train and test set AUCs at each boosting step to prove this? If you are 100% sure its not working then you can submit error ticket in XGBoost git project.
Answer 3: Maximize=FALSE is for custom optimization function (say custom merror type of thing). You always want to maximize/increase AUC so Maximize=TRUE is better for you.
Answer 4: Its mostly position based. Train part first. Next should go into validation/evaluation.
I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!
Im using the Gaussian Mixture Model to estimate loglikelihood function(the parameters are estimated by the EM algorithm)Im using Matlab...my data is of the size:17991402*1...17991402 data points of one dimension:
When I run gmdistribution.fit(X,2) I get the desired output
But when I run gmdistribution.fit(X,k) for k>2....the code crashes and I get the error"OUT OF MEMORY"..I have also tried an open source code which again gives me the same problem.Can someone help me out here?..Im basically looking for a code which will allow me to use different number of components on such a large dataset.
Thanks!!!
Is it possible for you to decrease the iteration time? The default is 100.
OPTIONS = statset('MaxIter',50,'Display','final','TolFun',1e-6)
gmdistribution.fit(X,3,OPTIONS)
Or you may consider under-sampling the original data.
A general solution to out of memory problem is described in this document.
I'm using mahout 0.7 to do some classification. I have an encoder for a continuous variable
ContinuousValueEncoder durationPlanEncoder = new ContinuousValueEncoder("duration_plan");
The feature associated with this encoder is a number of days and can range from about 6 to 16.
I'm using an OnlineLogisticRegression model and I use the encoder to train it:
durationPlanEncoder.addToVector(null, <duration_plan double val>, trainDataVector);
For simplicity (since i'm trying to understand this whole classification thing while also learning Mahout), i am using 2 variables: 1) a categorical variable with 6 categories -- one of which ("dev") always predicts the =1 category; and 2) this "duration_plan" variable.
What i expect to find is that, when i give the classifier test data that consists of the category "dev" and a "duration_plan" value, the accuracy of the classifier will increase as the "duration_plan" value i give it gets closer to its average value across the training data. This is not what i'm seeing, however. Instead, the accuracy of the classifier improves as the value of "duration_plan" goes to 0.0. However -- there are no training vectors with duration_plan=0.0!! Why would this be the case?
Then i modified my durationPlanEncoder as follows:
durationPlanEncoder.setProbes(2);
and the accuracy improved. It got even better when i made the number of probes 20, then 200. Why? What is setProbes() doing and is this an anomaly or is this actually how i should be doing it?
The final part of my question is to mention that, even after setting setProbes(20), changing the value of "duration_plan" in the test data has no effect on the accuracy of the classifier -- which I don't think is how it should be. If i give a value for duration_plan that doesn't even exist in any of the training data and thus is never correlated with the =1 class, i would expect the classifier to classify the test sample as =0. Right? Which makes me think i must be coding something just plain wrong. Any suggestions are appreciated.
Mahout documentation is woefully sparse.
thanks.