keras fit_generator parameter steps_per_epoch - neural-network

I want to use the keras model.fit_generator method for that I wrote my own generator and for the method I need to define the parameter "steps_per_epoch" I want to use every training data once for every epoch.
Now my problem is I generate the features in the generator I read wav-files and create the fft and before I start the training I don't know how much batches/samples I have. I can calculate the fft for every file before I start using the fit_generator method but every time I change my dataset(>20GB) I would need to recalculate the fft for every file and save the count for steps per epoch. Is there a better way how I can define that the fit_generator uses every sample only one time without calculate the steps per epoch? Or can my own generator pass the fit_generator when to start a new epoch?
Here is the code for my generator
def my_generator(filename_list):
while True:
for fname in filename_list:
data, sr = librosa.load(fname)
fft_result = librosa.core.stft(data)
batches = features.create_batches(fft_result, batch_size)
for i in range(len(batches)):
yield (batches[i], label)
model.fit_generator(my_generator(filename_list=filename_list, batch_size=batch_size), steps_per_epoch=100, epochs=10)

For each file in the list, you have to calculate fft that has 'n' batches where 'n' is different for each file. If this is the case than:
Navie method is to loop through the batch generator to calculate the actual number of batches. This process needs to be done only once. you can save that number for future use as well.
The second method could be to assign an arbitrary number to step_per_epoch. That arbitrary number should be greater than or equal to the number of files in the list multiplied by the number of the batches each fft can generate. The number of fft batches could be an arbitrary number. This way, if you shuffle data after the external "for" loop completes, then after some epoch statistically speaking all training data would be seen by the model. By using early_stop you can have properly converged model where "epochs" should be a very large value, 1000 for example.

Related

Parameter Variation: Get dataset for one specific iteration

I run a parameter variation experiment in Anylogic and collect histogram data about the number of specific agents. This histogram returns the min, mean, and max-value (among others).
I am looking for a way to get the dataset for one specific iteration only (the iteration that is closest to the mean-value of the histogram data).
Is there a way to return data for one specific iteration?
Many thanks!
Yes, but not with HistogramData objects.
Use normal Dataset objects in your experiment. In the properties, you can switch to "use x value as iteration" and store any value from your model iterations in the y-value.
Now you have a nice table with data from your individual iterations.
cheers

Set Batch Size *and* Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?
This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

How to get the dataset size of a Caffe net in python?

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?
You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.
It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).

Equivalent moment using Rainflow

I am trying to make a fatigue analysis from a load series and I woudl like to extract the equivalent moment for
number cycles=1000000
years=25
I have a time series for one hour that looks like:
Then I have read that Rainflow Analysis is a very good tool to extract the cycles from a time history. Therefore I apply:
%Rainflow moment
dt=time(2)-time(1);
[timeSeriesSig, extt] = sig2ext(timeSeries, dt);
rf = rainflow(timeSeriesSig,extt);
I read that
OUTPUT
rf - rainflow cycles: matrix 3xn or 5xn dependend on input,
rf(1,:) Cycles amplitude,
rf(2,:) Cycles mean value,
rf(3,:) Number of cycles (0.5 or 1.0),
rf(4,:) Begining time (when input includes dt or extt data),
rf(5,:) Cycle period (when input includes dt or extt data),
If I am interested in the number of cycles what does the term rf(3,:) mean? It is only containing 0.5 and 1 in the vector. I want to obtain an histogram with the number of cycles per bin amplitude. Thanks
If you are using code from the File Exchange or another source, it helps to link to where you obtained it.
Here, rf(3,:) is showing you whether the amplitudes and other outputs refer to half or full cycles. The rainflow algorithm finds half-cycles first, and then matches tensile and compressive half-cycles of equal amplitude to find full cycles. There are normally some residual half-cycles.
If you examine the rfhist function contained within that same File Exchange submission, you will see how they handle full and half amplitudes. Basically, two half-cycles at any given load amplitude will be counted as one full cycle when creating the histogram.