Is it necessary to initialized the weights for every time retraining the same model in matlab with nntool? - matlab

I know for the ANN model, the initial weights are random. If I train a model and repeat training 10 times by nntool, do the weights initialize every time when I click the training button, or still use the same initial weights you just adjusted?

I am not sure if the nntool you refer to uses the train method (see here https://de.mathworks.com/help/nnet/ref/train.html).
I have used this method quite extensively and it works in a similar way as tensorflow, you store a number of checkpoints and load the latest status to continue training from such point. The code would look something like this.
[feat,target] = iris_dataset;
my_nn = patternnet(20);
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint','CheckpointDelay',30);
Here we have requested that checkpoints are stored at a rate not greater than one each 30 seconds. When you want to continue training the net must be loaded from the checkpoint file as:
[feat,target] = iris_dataset;
load MyCheckpoint
my_nn = checkpoint.my_nn;
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint');
This solution involves training the network from the command line or via a script rather than using the GUI supplied by Mathworks. I honestly think this latter method is great for beginners but if you want to do any interesting clever start using the command line or even better switch to libraries like Torch or Tensorflow!
Hope it helps!

Related

Use the result values from previous simulation result as guess values for next simulation in Dymola

Initialization could be very cumbersome and easily lead to divergence. A simple strategy is to run the simulation when building a part of the whole system and use the simulation results to modify guess values.
Here is what I got in the PPT from Francesco Casella and the book from Daniel Bouskela.
I found that I could use an option in Dymola as follows, but instead of using the initialization result, I wanna use the result when reaching a steady state. So I'd like to use a python script to extract the result from the .mat result file, then modify the iteration variables automatically. But the key problem is that I don't know when I add more components in my model, the iteration variable set of existing components would change, I don't know what kind of effect would this causes.
Anyone got opinion on this issue, welcome to answer this question.
So my question is where should I find the python
You can use the end values (= steady state) of the simulation result in order to create a new initialization (Dymola Manual 1, section 2.5.12) . If the component names are the same in the sub system model and the total model, you can run the script created in the subsystem model on the larger system model as well. But you have to check if your models have initial equations that hinder an initialization from the outside (see section 4.2 in https://2012.international.conference.modelica.org/proceedings/html/pdf/ecp12076927_KruegerMehlhaseSchmitz.pdf)
It should also be possible to initialize it steady state. Instead of providing initial values for a state x and fixing it, you can provide initial equations for the derivatives such as der(x) = 0;
With that setup activate Save Initial Results and you should be good to go.

Impact of using data shuffling in Pytorch dataloader

I implemented an image classification network to classify a dataset of 100 classes by using Alexnet as a pretrained model and changing the final output layers.
I noticed when I was loading my data like
trainloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=False)
, I was getting accuracy on validation dataset around 2-3 % for around 10 epochs but when I just changed shuffle=True and retrained the network, the accuracy jumped to 70% in the first epoch itself.
I was wondering if it happened because in the first case the network was being shown one example after the other continuously for just one class for few instances resulting in network making poor generalizations during training or is there some other reason behind it?
But, I did not expect that to have such a drastic impact.
P.S: All the code and parameters were exactly the same for both the cases except changing the shuffle option.
Yes it totally can affect the result! Shuffling the order of the data that we use to fit the classifier is so important, as the batches between epochs do not look alike.
Checking the Data Loader Documentation it says:
"shuffle (bool, optional) – set to True to have the data reshuffled at every epoch"
In any case, it will make the model more robust and avoid over/underfitting.
In your case this heavy increase of accuracy (from the lack of awareness of the dataset) probably is due to how the dataset is "organised" as maybe, as an example, each category goes to a different batch, and in every epoch, a batch contains the same category, which derives to a very bad accuracy when you are testing.
PyTorch did many great things, and one of them is the DataLoader class.
DataLoader class takes the dataset (data), sets the batch_size (which is how many samples per batch to load), and invokes the sampler from a list of classes:
DistributedSampler
SequentialSampler
RandomSampler
SubsetRandomSampler
WeightedRandomSampler
BatchSampler
The key thing samplers do is how they implement the iter() method.
In case of SequentionalSampler it looks like this:
def __iter__(self):
return iter(range(len(self.data_source)))
This returns an iterator, for every item in the data_source.
When you set shuffle=True that would not use SequentionalSampler, but instead the RandomSampler.
And this may improve the learning process.

Neural network gets only 50% good prediction on test data

I made a neural network whice i want to classify the input data (400 caracteristics per input data) as one of the five arabic dialects. I divede the trainig data in "train data", "validation data" and than "test date", with net.divideFcn = 'dividerand'; . I use trainbr as training function, whice results in a long training, that's because i have 9000 elements in training data.
For the network arhitecture i used two-layers, first with 10 perceptrons, second with 5, 5 because i use one vs all strategy.
The network training ends usually with minimum gradient reached, rather than minimux error.
How can i make the network predict better? Could it be o problem with generalization (the network learn very well the training data, but test on new data tends to fail?
Should i add more perceptrons to the first layer? I'm asking that because i take about a hour to train the network when i have 10 perceptrons on first layer, so the time will increase.
This is the code for my network:
[Test] = load('testData.mat');
[Ex] = load('trainData.mat');
Ex.trainVectors = Ex.trainVectors';
Ex.trainLabels = Ex.trainLabels';
net = newff(minmax(Ex.trainVectors),[10 5] ,{'logsig','logsig'},'trainlm','learngdm','sse');
net.performFcn = 'mse';
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.95;
net.trainParam.epochs = 1000;
net.trainParam.goal = 0;
net.trainParam.max_fail = 50;
net.trainFcn = 'trainbr';
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.valRatio = 0.15;
net.divideParam.testRatio = 0.15;
net = init(net);
net = train(net,Ex.trainVectors,Ex.trainLabels);
Thanks !
Working with neural networks is some type of creative work. So noone can't give you the only true answer. But I can give some advices based on my own experience.
First of all - check the network error when training ends (on training and validation data sets. Before you start to use test data set). You told it is minimum but what is its actual value? If it 50% too, so we have bad data or wrong net architecture.
If error for train data set is OK. Next step - lets check how much the coefficients of your net are changing at the validation step. And what's up about the error here. If they changed dramatically that's the sigh our architecture is wrong: Network does not have the ability to generalize and will retrain at every new data sets.
What else can we do before changing architecture? We can change the number of epochs. Sometimes we can get good results but it is some type of random - we must be sure the changing of coefficient is small at the ending steps of training. But as I remember nntool check it automatically, so maybe we can skip this step.
One more thing I want to recommend to you - change train data set. Maybe you know rand is give you always the same number at start of matlab, so if you create your data sets only once you can work with the same sets always. This problem is also about non-homogeneous data. It can be that some part of your data is more important than other. So if some different random sets will give about the same error data is ok and we can go further. If not - we need to work with data and split it more carefully. Sometimes I avoid using dividerand and divide data manually.
Sometimes I tried to change the type of activation function. But here you use perceptron... So the idea - try to use sigma- or linear- neurons instead of perceptrons. This rarely leads to significant improvements but can help.
If all this steps can't give you enough you have to change net architecture. And the number of neurons in the first layer is the first you have to do. Usually when I work on the neural network I spend a lot of time trying not only different number of neurons but the different types of nets too.
For example, I found interesting article about your topic: link at Alberto Simões article. And that's what they say:
Regarding the number of units in the hidden layers, there are some
rules of thumb: use the same number of units in all hidden layers, and
use at least the same number of units as the maximum between the
number of classes and the number of features. But there can be up to
three times that value. Given the high number of features we opted to
keep that same number of units in the hidden layer.
Some advices from comments:
Data split method (for train and test data sets) depends on your data. For example, I worked on industry data and found that at the last part of the data set technological parameters (pressure for some equipment) was changed. So I have to get data for both operation modes to train data set. But for your case I don't thing there are the same problem... I recommend you to try several random sets (just check they are really different!).
For measuring net error I usually calculate full vector of errors - I train net and then check it's work for all values to get the whole errors vector. It's useful to get some useful vies like histograms and etc and I can see where my net is go wrong. It is not necessary and even harmful to get sse (or mse) close to zero - usually that's mean you already overtrain the net. For the first approximation I usually try to get 80-95% of correct values on training data set and then try the net on the test data set.

xgboost R package early_stop_rounds does not trigger

I am using xgb.train() in xgboost R package to fit a classification model. I am trying to figure out what's the best iteration to stop the tree. I set the early_stop_rounds=6 and by watching each iteration's metrics I can clearly see that the auc performance on validation data reach the max and then decrease. However the model does not stop and keep going until the specified nround is reached.
Question 1: Is it the best model (for given parameter) defined at iteration when validation performance start to decrease?
Question 2: Why does the model does not stop when auc on validation start to decrease?
Question 3: How does Maximize parameter=FALSE mean? What will make it stop if it is set to FALSE? does it have to be FALSE when early_stop_round is set?
Question 4: How does the model know which one is the validation data in the watchlist? I've seen people use test=,eval=, validation1= etc?
Thank you!
param<-list(
objective="binary:logistic",
booster="gbtree",
eta=0.02, #Control the learning rate
max.depth=3, #Maximum depth of the tree
subsample=0.8, #subsample ratio of the training instance
colsample_bytree=0.5 # subsample ratio of columns when constructing each tree
)
watchlist<-list(train=mtrain,validation=mtest)
sgb_model<-xgb.train(params=param, # this is the modeling parameter set above
data = mtrain,
scale_pos_weight=1,
max_delta_step=1,
missing=NA,
nthread=2,
nrounds = 500, #total run 1500 rounds
verbose=2,
early_stop_rounds=6, #if performance not improving for 6 rounds, model iteration stops
watchlist=watchlist,
maximize=FALSE,
eval.metric="auc" #Maximize AUC to evaluate model
#metric_name = 'validation-auc'
)
Answer 1: No, not the best but good enough from bias-variance
tradeoff point of view.
Answer 2: It works, may be there is some problem with your code. Would you please give share the progress output of train and test set AUCs at each boosting step to prove this? If you are 100% sure its not working then you can submit error ticket in XGBoost git project.
Answer 3: Maximize=FALSE is for custom optimization function (say custom merror type of thing). You always want to maximize/increase AUC so Maximize=TRUE is better for you.
Answer 4: Its mostly position based. Train part first. Next should go into validation/evaluation.

What is the meaning of "drop" and "sgd" while training custom ner model using spacy?

I am training a custom ner model to identify organization name in addresses.
My training loop looks like this:-
for itn in range(100):
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA, size=compounding(15., 32., 1.001))
for batch in batches
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.25, losses=losses)
print('Losses', losses)
Can someone explain the parameters "drop", "sgd", "size" and give some ideas to how should I change these values, so that my model performs better.
You can find details and tips in the spaCy documentation:
https://spacy.io/usage/training#tips-batch-size:
The trick of increasing the batch size is starting to become quite popular ... In training the various spaCy models, we haven’t found much advantage from decaying the learning rate – but starting with a low batch size has definitely helped
batch_size = compounding(1, max_batch_size, 1.001)
This will set the batch size to start at 1, and increase each batch until it reaches a maximum size.
https://spacy.io/usage/training#tips-dropout:
For small datasets, it’s useful to set a high dropout rate at first, and decay it down towards a more reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying utility function to facilitate this. You might try setting:
dropout = decaying(0.6, 0.2, 1e-4)
https://spacy.io/usage/training#annotations:
sgd: An optimizer, i.e. a callable to update the model’s weights. If not set, spaCy will create a new one and save it for further use.
The drop, sgd and size are some of the parameters you can customize to optimize your training.
drop is used to change the value of dropout.
size is used to change the size of the batch
sgd is used to change various hyperparameters such as learning rate, Adam beta1 and beta2 parameters, gradient clipping and L2 regularisation.
I consider the sgd to be a very important argument to experiment with.
To help you, I wrote a short blog post showing how to customize any spaCy parameters from your python interpreter (e.g. jupyter notebook). No command line interface required.