How to train neural networks on big sample sets in Matlab? - matlab

I am trying to train neural network on big training set.
inputs consists of aprox 4 million of columns and 128 rows, and targets consisting of 62 rows.
hiddenLayerSize is 128.
The script is follows:
net = patternnet(hiddenLayerSize);
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
net.trainFcn = 'trainbfg';
net.performFcn = 'mse'; % Mean squared error
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
'plotregression', 'plotfit'};
net.trainParam.show = 1;
net.trainParam.showCommandLine = 1;
[net,tr] = train(net,inputs,targets, 'showResources', 'yes', 'reduction', 10);
When train starts to execute, Matlab hangs, Windows hangs or slow, swapping runs disk huge and nothing else happens for dozens of minutes.
Computer is 12Gb Windows x64, Matlab is also 64 bit. Memory usage in process manager varies during operation.
What else can be done except reducing train set?
If reducing train set, then to which level? How to estimate it's size except trying?
Why doesn't function displays anything?

It is fairly hard to diagnose such problems from remote, to the point that I am not even sure that anything anyone can answer might actually help. Moreover you are asking several questions in one so I will take it step by step. Ultimately I will try to give you a better understanding of the memory consumption of your script.
Memory consumption
Dataset Size and Copies
Starting from the size of the dataset you are loading in memory, assuming that each entry contains a double floating-point precision number, your training data set requires (4e6 * 128 * 8) Bytes of memory which roughly resolves to 3.81 GB. If I understand correctly, your array of outputs contains (4e6 * 62) entries which become (4e6 * 62 * 8) Bytes, roughly equivalent to 1,15 GB. So even before running the network training you are consuming circa 5GB of memory.
Now yes MATLAB uses lazy copy so any assignment:
training = zeros(4e6, 128);
copy1 = training;
copy2 = training;
will not require new memory. However, any slicing operation:
training = zeros(4e6, 128);
part1 = training(1:1000, :);
part1 = training(1001:2000, :);
will indeed allocate more memory. Hence when selecting your training, validation and testing subsets:
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
internally the train() function could potentially be re-allocating the same amount of memory twice. Your grand total would now be 10GB. If you now consider that you operating system is running, along with a bunch of other applications, it is easy to understand why everything suddenly slows down. I might be telling you something obvious here but: your dataset is very large.
Profiling Helps
Now, whilst I am pretty sure about my 5 GB consumption calculation, I am not sure if this is a valid assumption. Bottom-line is I don't know the inside workings of the train() function that well.
This is why I urge you to test it out with MATLAB's very own profiler. This will indeed give you a much better understanding of function calls and memory consumption.
Reducing Memory Usage
What can be done to reduce memory consumption? Now this is probably the question that has been haunting programmers since the dawn of times. :) Once again, it is hard to provide a unique answer as the solution is often dependent on the task, problem and tools at hand. Matlab has a, let's give it the benefit of the doubt, informative page on how to reduce memory usage. Very often though the problem lies in the size of the data to be loaded in memory.
I, on one hand, would of course start by reducing the size of your dataset. Do you really need 4e6 * 128 datapoints? If you do then you might consider investing into dedicated solutions such as high-performance servers to perform your computation. If not you, but only you, must look at your dataset and start analysing which features might be unnecessary, to cut down the columns, and, most importantly, which samples might be unnecessary, to cut down the rows.
Being optimistic
On a side note, you did not complain about any OutOfMemory errors from MATLAB, which could be a good sign. Maybe your machine is simply hanging because the computation is THAT intensive. And this too is a reasonable assumption as you are creating a network with 128 hidden layers, 62 outputs and running several epochs of training, as you should be doing.
Kill The JVM
What you can do to put less load on the machine is to run MATLAB without the Java Environment (JVM). This ensures that MATLAB itself will require less memory to run. The JVM can be disabled by running:
matlab -nojvm
This works if you do not need to display any graphics, as MATLAB will run in a console-like environment.

Related

matlab parfor is very slow with operation on a large matrix

I am writing a matlab code, which does some operations on a large matrix. First I create three 3D array
dw2 = 0.001;
W2 = [0:dw2:1];
dp = 0.001;
P1 = [dp:dp:1];
dI = 0.001;
I = [1:-dI:0];
[II,p1,ww2] = ndgrid(I,P1,W2);
Then my code basically does the following
G = [0:0.1:10]
Y = zeros(length(G),1)
for i = 1:1:length(G)
g = G(i);
Y(i) = myfunction(II,p1,ww2,g)
end
This code roughly takes about 100s, with each iteration being nearly 10s.
However, after I start parfor
ProcessPool with properties:
Connected: true
NumWorkers: 48
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes (30 minutes remaining)
SpmdEnabled: true
Then it is like running forever. The maximum number of workers is 48. I've also tried 2, 5, 10. All of these are slower than non-parallel computing. Is that because matlab copied II,p1,ww2 48 times and that causes the problem? Also myfunction involves a lot of vectorization. I have already optimized the myfunction. Will that lead to slow performance of parfor? Is there a way to utilize (some of) the 48 workers to speed up the code? Any comments are highly appreciated. I need to run millions of cases. So I really hope that I can utilize the 48 workers in some way.
It seems that you have large data, and a lot of cores. It is likely that you simply run out of memory, which is why things get so slow.
I would suggest that you set up your workers to be threads, not separate processes.
You can do this with parpool('threads'). Your code must conform to some limitations, not all code can be run this way, see here.
In thread-based parallelism, you have shared memory (arrays are not copied). In process-based parallelism, you have 48 copies of MATLAB running on your computer at the same time, each needing their own copy of your data. That latter system was originally designed to work on a compute cluster, and was later retrofitted to work on a single machine with two or four cores. I don’t think it was ever meant for 48 cores.
If you cannot use threads with your code, configured your parallel pool to have fewer workers. For example parpool('local',8).
For more information, see this documentation page.

Why is it faster to transfer data from CPU to GPU rather than GPU to CPU?

I've noticed that transferring data to recent high end GPUs is faster than gathering it back to the CPU. Here are the results using a benchmarking function provided to me by mathworks tech-support running on an older Nvidia K20 and a recent Nvidia P100 with PCIE:
Using a Tesla P100-PCIE-12GB GPU.
Achieved peak send speed of 11.042 GB/s
Achieved peak gather speed of 4.20609 GB/s
Using a Tesla K20m GPU.
Achieved peak send speed of 2.5269 GB/s
Achieved peak gather speed of 2.52399 GB/s
I've attached the benchmark function below for reference. What is the reason for the asymmetry on the P100? Is this system dependent or is it the norm on recent high end GPUs? Can the gather speed be increased?
gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name)
sizeOfDouble = 8; % Each double-precision number needs 8 bytes of storage
sizes = power(2, 14:28);
sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
gpuData = randi([0 9], numElements, 1, 'gpuArray');
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Achieved peak send speed of %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Achieved peak gather speed of %g GB/s\n',max(gatherBandwidth))
Edit: we now know it is not system dependent (see comments) . I still want to know the reason for the assymetry or if it can be changed.
This is a CW for anybody interested in posting benchmarks from their machine. Contributors are encouraged to leave their details in case some future question arises regarding their results.
System: Win10, 32GB DDR4-2400Mhz RAM, i7 6700K. MATLAB: R2018a.
Using a GeForce GTX 660 GPU.
Achieved peak send speed of 7.04747 GB/s
Achieved peak gather speed of 3.11048 GB/s
Warning: The measured time for F may be inaccurate because it is running too fast. Try measuring something that takes
longer.
Contributor: Dev-iL
System: Win7, 32GB RAM, i7 4790K. MATLAB: R2018a.
Using a Quadro P6000 GPU.
Achieved peak send speed of 1.43346 GB/s
Achieved peak gather speed of 1.32355 GB/s
Contributor: Dev-iL
I am not familiar with Matlab GPU toolboxes, but I suspect that the second transfer (that gets data back from GPU) starts before the first has ended.
% Time sending to GPU
sendFcn = #() gpuArray(hostData);
sendTimes(ii) = gputimeit(sendFcn);
%
%No synchronization here
%
% Time gathering back from GPU
gatherFcn = #() gather(gpuData);
gatherTimes(ii) = gputimeit(gatherFcn);
A similar question, for a C program, was posted here:
copy from GPU to CPU is slower than copying CPU to GPU
In that case, there is no explicit sync after launching a thread on the GPU and getting results data back from the GPU.
So the function that gets data back, in C cudaMemcpy(), has to wait for the GPU to end the previous launched thread, before transferring data, thus inflating the time measured for the data transfer.
With the Cuda C API, it is possible to force the CPU to wait for the GPU to end the previously launched thread(s), with:
cudaDeviceSynchronize();
And only then start measuring the time to transfer data back.
Maybe in Matlab there is also some synchronization primitive.
Also in the same answer, it is recommended to measure time with (Cuda) Events.
In this POST on optimizing data transfers, also in C sorry, Events are used to measure data transfer times:
https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
The time for transferring data is the same in both directions.

Why does memory usage increase as a Keras neural network is being trained with fit_generator and validation_data?

I have 32 GB of RAM and am training a large dataset using a Keras sequential neural network on a Windows 7 machine. Because of the size of the dataset, I have opted to use fit_generator taking in around 5000 samples in each batch which has about 500 features each. I have a gc.collect() in the generator to address the potential memory leak, which helped in previous iterations of this model.
For the first few steps of the first epoch, memory consumption is low. Then after around 15 steps, it starts to increase and decrease until eventually it caps off at 27.6 GB.
Can anyone explain why the memory usage increases over time? Also, its been hundreds of steps for this first epoch, and the memory is still sitting at 27.6 GB. Does this have any significance?
The NN itself is 3 layers deep, with 50 neurons in each. I understand that there are some memory requirements for storing the weights, but would this increase over time?
def gen_data(max,rows,skip):
import gc
while True:
data = pd.read_csv(csv,skiprows=range(1,skip),nrows=rows,index_col=0)
x,y = features(data)
yield x,y
skip += rows
if max is not None and skip >= max:
skip = 0
gc.collect()
model=Sequential()
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
hist = model.fit_generator(gen_data(8000000,5000),epochs=50,
steps_per_epoch=int(8000000/5000),verbose=1,callbacks=callbacks_list,
class_weight=class_weight,validation_steps=10,validation_data=gen_data(800000,80000))
-- edit --
When removing validation_steps and validation_data, the process does not blow up in memory. This seems like odd behavior because I would not expect the validation data to be used until the end of the epoch. Any ideas?

Matlab R2017a memory profiler gives a ridiculous number for allocated memory

My code is:
function eigs_mem_test
N = 20000;
density = 0.2;
numOfModes = 250;
A = sprand(N, N, density);
profile -memory on
eigs(A, numOfModes, 0.0)
profile off
profsave(profile('info'), 'eigs_test')
profview
end
And this returns
i.e. it says that MATLAB allocated 18014398508117708.00 Kb or 1.8e10 Gb -- completely impossible. How did this happen? The code finishes with correct output and in htop I can see the memory usage vary quite a bit, but staying under 16G.
For N = 2000, I get sensible results (i.e. 0.2G allocated.)
How can I profile this case effectively, if I want to obtain an upper bound on memory used for large sparse matrices?
I use MATLAB R2017a.
I cannot reproduce your issue in R2017b, with 128GB of RAM on my machine. Here is the result after running your example code:
Notably, the function peaked at 14726148Kb, or ~1.8GB. I'm more confused by the units MATLAB has used here, as I saw nearer 14GB of usage in the task manager, which matches your large observed usage (and 1.4e7KB in GB), I can only think the profiler is meant to state KB (kilobytes) instead of Kb (kilobits).
Ridiculously large, unexpected values like this are often the result of overflow, so this could be an internal overflow bug.
You could use whos to get the size on disk of a variable
w = whos('A'); % get details of variable A
sizeOnDisk = w.bytes; % get size on disk
This doesn't necessarily tell you how much memory a function like eigs in your example uses though. You could poll memory within your function to get the current usage.
I'll resist exploring this further, since the question of how to profile for memory usage has already been asked and answered.
N.B. I'm not sure why my machine was ~100x slower than yours, I assume the image of your memory usage didn't come from actually running your example code? Or my RAM is awful...

Neural network gets only 50% good prediction on test data

I made a neural network whice i want to classify the input data (400 caracteristics per input data) as one of the five arabic dialects. I divede the trainig data in "train data", "validation data" and than "test date", with net.divideFcn = 'dividerand'; . I use trainbr as training function, whice results in a long training, that's because i have 9000 elements in training data.
For the network arhitecture i used two-layers, first with 10 perceptrons, second with 5, 5 because i use one vs all strategy.
The network training ends usually with minimum gradient reached, rather than minimux error.
How can i make the network predict better? Could it be o problem with generalization (the network learn very well the training data, but test on new data tends to fail?
Should i add more perceptrons to the first layer? I'm asking that because i take about a hour to train the network when i have 10 perceptrons on first layer, so the time will increase.
This is the code for my network:
[Test] = load('testData.mat');
[Ex] = load('trainData.mat');
Ex.trainVectors = Ex.trainVectors';
Ex.trainLabels = Ex.trainLabels';
net = newff(minmax(Ex.trainVectors),[10 5] ,{'logsig','logsig'},'trainlm','learngdm','sse');
net.performFcn = 'mse';
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.95;
net.trainParam.epochs = 1000;
net.trainParam.goal = 0;
net.trainParam.max_fail = 50;
net.trainFcn = 'trainbr';
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.valRatio = 0.15;
net.divideParam.testRatio = 0.15;
net = init(net);
net = train(net,Ex.trainVectors,Ex.trainLabels);
Thanks !
Working with neural networks is some type of creative work. So noone can't give you the only true answer. But I can give some advices based on my own experience.
First of all - check the network error when training ends (on training and validation data sets. Before you start to use test data set). You told it is minimum but what is its actual value? If it 50% too, so we have bad data or wrong net architecture.
If error for train data set is OK. Next step - lets check how much the coefficients of your net are changing at the validation step. And what's up about the error here. If they changed dramatically that's the sigh our architecture is wrong: Network does not have the ability to generalize and will retrain at every new data sets.
What else can we do before changing architecture? We can change the number of epochs. Sometimes we can get good results but it is some type of random - we must be sure the changing of coefficient is small at the ending steps of training. But as I remember nntool check it automatically, so maybe we can skip this step.
One more thing I want to recommend to you - change train data set. Maybe you know rand is give you always the same number at start of matlab, so if you create your data sets only once you can work with the same sets always. This problem is also about non-homogeneous data. It can be that some part of your data is more important than other. So if some different random sets will give about the same error data is ok and we can go further. If not - we need to work with data and split it more carefully. Sometimes I avoid using dividerand and divide data manually.
Sometimes I tried to change the type of activation function. But here you use perceptron... So the idea - try to use sigma- or linear- neurons instead of perceptrons. This rarely leads to significant improvements but can help.
If all this steps can't give you enough you have to change net architecture. And the number of neurons in the first layer is the first you have to do. Usually when I work on the neural network I spend a lot of time trying not only different number of neurons but the different types of nets too.
For example, I found interesting article about your topic: link at Alberto Simões article. And that's what they say:
Regarding the number of units in the hidden layers, there are some
rules of thumb: use the same number of units in all hidden layers, and
use at least the same number of units as the maximum between the
number of classes and the number of features. But there can be up to
three times that value. Given the high number of features we opted to
keep that same number of units in the hidden layer.
Some advices from comments:
Data split method (for train and test data sets) depends on your data. For example, I worked on industry data and found that at the last part of the data set technological parameters (pressure for some equipment) was changed. So I have to get data for both operation modes to train data set. But for your case I don't thing there are the same problem... I recommend you to try several random sets (just check they are really different!).
For measuring net error I usually calculate full vector of errors - I train net and then check it's work for all values to get the whole errors vector. It's useful to get some useful vies like histograms and etc and I can see where my net is go wrong. It is not necessary and even harmful to get sse (or mse) close to zero - usually that's mean you already overtrain the net. For the first approximation I usually try to get 80-95% of correct values on training data set and then try the net on the test data set.