keras batchnorm has awful test performance

keras batchnorm has awful test performance - neural-network

During cross-validation on training data, use of batchnorm significantly improves performance. But (after retraining on entire training set) the presence of the batchnorm layer completely destroys the model's generalization to a holdout set. This is a little surprising, and I'm wondering if I'm implementing the test predictions incorrectly.
Generalization w/o the batchnorm layer present is fine (not high enough for my project's goals, but reasonable for such a simple net).
I cannot share my data, but does anyone see an obvious implementation error? Is there a flag that should be set to test mode? I can't find an answer in the docs, and dropout (which also should have different train/test behavior) works as expected. Thanks!
code:
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10)
from keras.callbacks import ModelCheckpoint
filepath="L1_batch1_weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
init = 'he_normal'
act = 'relu'
neurons1 = 80
dropout_rate = 0.5
model = Sequential()
model.add(Dropout(0.2, input_shape=(5000,)))
model.add(Dense(neurons1))
model.add(BatchNormalization())
model.add(Activation(act))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["accuracy"])
my_model = model.fit(X_train, y_train, batch_size=128, nb_epoch=150, validation_data =(X_test, y_test),callbacks=[early_stopping, checkpoint])
model.load_weights("L1_batch1_weights.best.hdf5")
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print("Created model and loaded weights from file")
probs = model.predict_proba(X_test,batch_size=2925)
fpr, tpr, thresholds = roc_curve(y_test, probs)

From the docs: "During training we use per-batch statistics to normalize the data, and during testing we use running averages computed during the training phase."
In my case training batch size was 128. At test time, I had manually set the batch size to the size of the complete test set (2925).
It makes sense that the statistics used for one batch size will obviously not be relevant to a batch size that is significantly different.
Changing the test batch size to the train batch size (128) produced more stable results. I played w/prediction batch sizes to observe the effects: prediction results were stable for any batch size +/- 3x of the training batch size, beyond that performance deteriorated.
There is some discussion of the impact of test batch size along with using batchnorm when used with load_weights() here:
https://github.com/fchollet/keras/issues/3423

Related

How to compensate if I cant do a large batch size in neural network

I am trying to run an action recognition code from GitHub. The original code used a batch size of 128 with 4 GPUS. I only have two gpus so I cannot match their bacth size number. Is there anyway I can compensate this difference in batch. I saw somewhere that iter_size might compensate according to a formula effective_batchsize= batch_size*iter_size*n_gpu. what is iter_size in this formula?
I am using PYthorch not Caffe.

In pytorch, when you perform the backward step (calling loss.backward() or similar) the gradients are accumulated in-place. This means that if you call loss.backward() multiple times, the previously calculated gradients are not replaced, but in stead the new gradients get added on to the previous ones. That is why, when using pytorch, it is usually necessary to explicitly zero the gradients between minibatches (by calling optimiser.zero_grad() or similar).
If your batch size is limited, you can simulate a larger batch size by breaking a large batch up into smaller pieces, and only calling optimiser.step() to update the model parameters after all the pieces have been processed.
For example, suppose you are only able to do batches of size 64, but you wish to simulate a batch size of 128. If the original training loop looks like:
optimiser.zero_grad()
loss = model(batch_data) # batch_data is a batch of size 128
loss.backward()
optimiser.step()
then you could change this to:
optimiser.zero_grad()
smaller_batches = batch_data[:64], batch_data[64:128]
for batch in smaller_batches:
loss = model(batch) / 2
loss.backward()
optimiser.step()
and the updates to the model parameters would be the same in each case (apart maybe from some small numerical error). Note that you have to rescale the loss to make the update the same.

The important concept is not so much the batch size; it's the quantity of epochs you train. Can you double the batch size, giving you the same cluster batch size? If so, that will compensate directly for the problem. If not, double the quantity of iterations, so you're training for the same quantity of epochs. The model will quickly overcome the effects of the early-batch bias.
However, if you are comfortable digging into the training code, myrtlecat gave you an answer that will eliminate the batch-size difference quite nicely.

Neural network gets only 50% good prediction on test data

I made a neural network whice i want to classify the input data (400 caracteristics per input data) as one of the five arabic dialects. I divede the trainig data in "train data", "validation data" and than "test date", with net.divideFcn = 'dividerand'; . I use trainbr as training function, whice results in a long training, that's because i have 9000 elements in training data.
For the network arhitecture i used two-layers, first with 10 perceptrons, second with 5, 5 because i use one vs all strategy.
The network training ends usually with minimum gradient reached, rather than minimux error.
How can i make the network predict better? Could it be o problem with generalization (the network learn very well the training data, but test on new data tends to fail?
Should i add more perceptrons to the first layer? I'm asking that because i take about a hour to train the network when i have 10 perceptrons on first layer, so the time will increase.
This is the code for my network:
[Test] = load('testData.mat');
[Ex] = load('trainData.mat');
Ex.trainVectors = Ex.trainVectors';
Ex.trainLabels = Ex.trainLabels';
net = newff(minmax(Ex.trainVectors),[10 5] ,{'logsig','logsig'},'trainlm','learngdm','sse');
net.performFcn = 'mse';
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.95;
net.trainParam.epochs = 1000;
net.trainParam.goal = 0;
net.trainParam.max_fail = 50;
net.trainFcn = 'trainbr';
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.valRatio = 0.15;
net.divideParam.testRatio = 0.15;
net = init(net);
net = train(net,Ex.trainVectors,Ex.trainLabels);
Thanks !

Working with neural networks is some type of creative work. So noone can't give you the only true answer. But I can give some advices based on my own experience.
First of all - check the network error when training ends (on training and validation data sets. Before you start to use test data set). You told it is minimum but what is its actual value? If it 50% too, so we have bad data or wrong net architecture.
If error for train data set is OK. Next step - lets check how much the coefficients of your net are changing at the validation step. And what's up about the error here. If they changed dramatically that's the sigh our architecture is wrong: Network does not have the ability to generalize and will retrain at every new data sets.
What else can we do before changing architecture? We can change the number of epochs. Sometimes we can get good results but it is some type of random - we must be sure the changing of coefficient is small at the ending steps of training. But as I remember nntool check it automatically, so maybe we can skip this step.
One more thing I want to recommend to you - change train data set. Maybe you know rand is give you always the same number at start of matlab, so if you create your data sets only once you can work with the same sets always. This problem is also about non-homogeneous data. It can be that some part of your data is more important than other. So if some different random sets will give about the same error data is ok and we can go further. If not - we need to work with data and split it more carefully. Sometimes I avoid using dividerand and divide data manually.
Sometimes I tried to change the type of activation function. But here you use perceptron... So the idea - try to use sigma- or linear- neurons instead of perceptrons. This rarely leads to significant improvements but can help.
If all this steps can't give you enough you have to change net architecture. And the number of neurons in the first layer is the first you have to do. Usually when I work on the neural network I spend a lot of time trying not only different number of neurons but the different types of nets too.
For example, I found interesting article about your topic: link at Alberto Simões article. And that's what they say:
Regarding the number of units in the hidden layers, there are some
rules of thumb: use the same number of units in all hidden layers, and
use at least the same number of units as the maximum between the
number of classes and the number of features. But there can be up to
three times that value. Given the high number of features we opted to
keep that same number of units in the hidden layer.
Some advices from comments:
Data split method (for train and test data sets) depends on your data. For example, I worked on industry data and found that at the last part of the data set technological parameters (pressure for some equipment) was changed. So I have to get data for both operation modes to train data set. But for your case I don't thing there are the same problem... I recommend you to try several random sets (just check they are really different!).
For measuring net error I usually calculate full vector of errors - I train net and then check it's work for all values to get the whole errors vector. It's useful to get some useful vies like histograms and etc and I can see where my net is go wrong. It is not necessary and even harmful to get sse (or mse) close to zero - usually that's mean you already overtrain the net. For the first approximation I usually try to get 80-95% of correct values on training data set and then try the net on the test data set.

PyBrain: MemoryError: while loading training dataset

I am trying to train a feedforward neural network, for binary classification. My Dataset is 6.2M with 1.5M dimension. I am using PyBrain. I am unable to load even a single datapoint. I am getting MemoryError.
My Code snippet is:
Train_ds = SupervisedDataSet(FV_length, 1) #FV_length is a computed value. 150000
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
#activate feature values
for index in nonzero_index_list:
feature_vector[index] = 1
Train_ds.addSample(feature_vector,class_label) # both the arguments are tuples

It looks like your computer just does not have the memory to add your feature and class label arrays to the supervised data set Train_ds.
If there is no way for you to allocate more memory to your system it might be a good idea to random sample from your data set and train on the smaller sample.
This should still give accurate results assuming the sample is large enough to be representative.

How to train neural networks on big sample sets in Matlab?

I am trying to train neural network on big training set.
inputs consists of aprox 4 million of columns and 128 rows, and targets consisting of 62 rows.
hiddenLayerSize is 128.
The script is follows:
net = patternnet(hiddenLayerSize);
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
net.trainFcn = 'trainbfg';
net.performFcn = 'mse'; % Mean squared error
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
'plotregression', 'plotfit'};
net.trainParam.show = 1;
net.trainParam.showCommandLine = 1;
[net,tr] = train(net,inputs,targets, 'showResources', 'yes', 'reduction', 10);
When train starts to execute, Matlab hangs, Windows hangs or slow, swapping runs disk huge and nothing else happens for dozens of minutes.
Computer is 12Gb Windows x64, Matlab is also 64 bit. Memory usage in process manager varies during operation.
What else can be done except reducing train set?
If reducing train set, then to which level? How to estimate it's size except trying?
Why doesn't function displays anything?

It is fairly hard to diagnose such problems from remote, to the point that I am not even sure that anything anyone can answer might actually help. Moreover you are asking several questions in one so I will take it step by step. Ultimately I will try to give you a better understanding of the memory consumption of your script.
Memory consumption
Dataset Size and Copies
Starting from the size of the dataset you are loading in memory, assuming that each entry contains a double floating-point precision number, your training data set requires (4e6 * 128 * 8) Bytes of memory which roughly resolves to 3.81 GB. If I understand correctly, your array of outputs contains (4e6 * 62) entries which become (4e6 * 62 * 8) Bytes, roughly equivalent to 1,15 GB. So even before running the network training you are consuming circa 5GB of memory.
Now yes MATLAB uses lazy copy so any assignment:
training = zeros(4e6, 128);
copy1 = training;
copy2 = training;
will not require new memory. However, any slicing operation:
training = zeros(4e6, 128);
part1 = training(1:1000, :);
part1 = training(1001:2000, :);
will indeed allocate more memory. Hence when selecting your training, validation and testing subsets:
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
internally the train() function could potentially be re-allocating the same amount of memory twice. Your grand total would now be 10GB. If you now consider that you operating system is running, along with a bunch of other applications, it is easy to understand why everything suddenly slows down. I might be telling you something obvious here but: your dataset is very large.
Profiling Helps
Now, whilst I am pretty sure about my 5 GB consumption calculation, I am not sure if this is a valid assumption. Bottom-line is I don't know the inside workings of the train() function that well.
This is why I urge you to test it out with MATLAB's very own profiler. This will indeed give you a much better understanding of function calls and memory consumption.
Reducing Memory Usage
What can be done to reduce memory consumption? Now this is probably the question that has been haunting programmers since the dawn of times. :) Once again, it is hard to provide a unique answer as the solution is often dependent on the task, problem and tools at hand. Matlab has a, let's give it the benefit of the doubt, informative page on how to reduce memory usage. Very often though the problem lies in the size of the data to be loaded in memory.
I, on one hand, would of course start by reducing the size of your dataset. Do you really need 4e6 * 128 datapoints? If you do then you might consider investing into dedicated solutions such as high-performance servers to perform your computation. If not you, but only you, must look at your dataset and start analysing which features might be unnecessary, to cut down the columns, and, most importantly, which samples might be unnecessary, to cut down the rows.
Being optimistic
On a side note, you did not complain about any OutOfMemory errors from MATLAB, which could be a good sign. Maybe your machine is simply hanging because the computation is THAT intensive. And this too is a reasonable assumption as you are creating a network with 128 hidden layers, 62 outputs and running several epochs of training, as you should be doing.
Kill The JVM
What you can do to put less load on the machine is to run MATLAB without the Java Environment (JVM). This ensures that MATLAB itself will require less memory to run. The JVM can be disabled by running:
matlab -nojvm
This works if you do not need to display any graphics, as MATLAB will run in a console-like environment.

What is the meaning of "drop" and "sgd" while training custom ner model using spacy?

I am training a custom ner model to identify organization name in addresses.
My training loop looks like this:-
for itn in range(100):
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA, size=compounding(15., 32., 1.001))
for batch in batches
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.25, losses=losses)
print('Losses', losses)
Can someone explain the parameters "drop", "sgd", "size" and give some ideas to how should I change these values, so that my model performs better.

You can find details and tips in the spaCy documentation:
https://spacy.io/usage/training#tips-batch-size:
The trick of increasing the batch size is starting to become quite popular ... In training the various spaCy models, we haven’t found much advantage from decaying the learning rate – but starting with a low batch size has definitely helped
batch_size = compounding(1, max_batch_size, 1.001)
This will set the batch size to start at 1, and increase each batch until it reaches a maximum size.
https://spacy.io/usage/training#tips-dropout:
For small datasets, it’s useful to set a high dropout rate at first, and decay it down towards a more reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying utility function to facilitate this. You might try setting:
dropout = decaying(0.6, 0.2, 1e-4)
https://spacy.io/usage/training#annotations:
sgd: An optimizer, i.e. a callable to update the model’s weights. If not set, spaCy will create a new one and save it for further use.

The drop, sgd and size are some of the parameters you can customize to optimize your training.
drop is used to change the value of dropout.
size is used to change the size of the batch
sgd is used to change various hyperparameters such as learning rate, Adam beta1 and beta2 parameters, gradient clipping and L2 regularisation.
I consider the sgd to be a very important argument to experiment with.
To help you, I wrote a short blog post showing how to customize any spaCy parameters from your python interpreter (e.g. jupyter notebook). No command line interface required.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse