I am new to pytorch.
I am training an ANN for classification on the MNIST dataset.
train_loader = DataLoader(train_data,batch_size=200,shuffle=True)
I am confused. The dataset is of 60,000 images and I have set batch size of 6000 and my model has 30 epochs.
Will every epoch see only 6000 images or will every epoch see 10 batches of 6000 images?
Every call to the dataset iterator will return batch of images of size batch_size. Hence you will have 10 batches until you exhaust all the 60000 images.
Related
I am using the xcorr as bellow
simM=xcorr(data,10,'normalized');
Here data is a 1200 by 1200 double and the the output i.e. simM is a 21*1440000 double . Now i want to calculate for more values but my . systems ram is only 64 GB. For data bigger than 1500 by 1500 the system is giving out of memory error. Which is acceptable. So according to some of the answers i have read in MATLAB community i found out splitting the matrix is an option. But can you please describe how can i do that.
For 1100 by 1100 samples the time consumed by my system is 147 seconds if possible can you suggest a way to improve the speed.
loop on the columns of d
for i=1:1200
for j=1:1200
simulation((i-1)*1200+j)=xcorr(d(:,i),d(:,j),10,'normalize');
end
end
for speed convert your double to single , (single(d)) it reduces ram usage and calculation time, and you could use parfor or gpu computing for further speed up the for loop.
d=single(d);
d=gpuArray(d);
...
I have 32 GB of RAM and am training a large dataset using a Keras sequential neural network on a Windows 7 machine. Because of the size of the dataset, I have opted to use fit_generator taking in around 5000 samples in each batch which has about 500 features each. I have a gc.collect() in the generator to address the potential memory leak, which helped in previous iterations of this model.
For the first few steps of the first epoch, memory consumption is low. Then after around 15 steps, it starts to increase and decrease until eventually it caps off at 27.6 GB.
Can anyone explain why the memory usage increases over time? Also, its been hundreds of steps for this first epoch, and the memory is still sitting at 27.6 GB. Does this have any significance?
The NN itself is 3 layers deep, with 50 neurons in each. I understand that there are some memory requirements for storing the weights, but would this increase over time?
def gen_data(max,rows,skip):
import gc
while True:
data = pd.read_csv(csv,skiprows=range(1,skip),nrows=rows,index_col=0)
x,y = features(data)
yield x,y
skip += rows
if max is not None and skip >= max:
skip = 0
gc.collect()
model=Sequential()
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(50, input_dim = train_shape, activation='linear'))
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
hist = model.fit_generator(gen_data(8000000,5000),epochs=50,
steps_per_epoch=int(8000000/5000),verbose=1,callbacks=callbacks_list,
class_weight=class_weight,validation_steps=10,validation_data=gen_data(800000,80000))
-- edit --
When removing validation_steps and validation_data, the process does not blow up in memory. This seems like odd behavior because I would not expect the validation data to be used until the end of the epoch. Any ideas?
I understand that bigger batch size gives more accurate results from here. But I'm not sure which batch size is "good enough". I guess bigger batch sizes will always be better but it seems like at a certain point you will only get a slight improvement in accuracy for every increase in batch size. Is there a heuristic or a rule of thumb on finding the optimal batch size?
Currently, I have 40000 training data and 10000 test data. My batch size is the default which is 256 for training and 50 for the test. I am using NVIDIA GTX 1080 which has 8Gigs of memory.
Test-time batch size does not affect accuracy, you should set it to be the largest you can fit into memory so that validation step will take shorter time.
As for train-time batch size, you are right that larger batches yield more stable training. However, having larger batches will slow training significantly. Moreover, you will have less backprop updates per epoch. So you do not want to have batch size too large. Using default values is usually a good strategy.
See my masters thesis, page 59 for some of the reasons why to choose a bigger batch size / smaller batch size. You want to look at
epochs until convergence
time per epoch: higher is better
resulting model quality: lower is better (in my experiments)
A batch size of 32 was good for my datasets / models / training algorithm.
Suppose if one has training examples, and his batch size is 500, then it will take 2 iterations to complete 1 epoch. Now let's say I am using the caffe framework's on the fly data-augmentation, i.e; 10 crops per example.
My question is will the epoch size will still be 2 iterations as in the above examples or become 2*10 = 20?
An epoch is the the number of iterations it takes to go over the training data once. Since you augment your data, it will take you 10 times more iterations to complete one pass over the training data. Hence 1 epoch = 2*10 iterations now.
I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?
You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.
It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).