Epoch size (in terms of iterations) with data augmentation caffe - neural-network

Suppose if one has training examples, and his batch size is 500, then it will take 2 iterations to complete 1 epoch. Now let's say I am using the caffe framework's on the fly data-augmentation, i.e; 10 crops per example.
My question is will the epoch size will still be 2 iterations as in the above examples or become 2*10 = 20?

An epoch is the the number of iterations it takes to go over the training data once. Since you augment your data, it will take you 10 times more iterations to complete one pass over the training data. Hence 1 epoch = 2*10 iterations now.

Related

Calculating Mbps in Prometheus from cumulative total

I have a metric in Prometheus called unifi_devices_wireless_received_bytes_total, it represents the cumulative total amount of bytes a wireless device has received. I'd like to convert this to the download speed in Mbps (or even MBps to start).
I've tried:
rate(unifi_devices_wireless_received_bytes_total[5m])
Which I think is saying: "please give me the rate of bytes received per second", over the last 5 minutes, based on the documentation of rate, here.
But I don't understand what "over the last 5 minutes" means in this context.
In short, how can I determine the Mbps based on this cumulative amount of bytes metric? This is ultimately to display in a Grafana graph.
You want rate(unifi_devices_wireless_received_bytes_total[5m]) / 1000 / 1000
But I don't understand what "over the last 5 minutes" means in this context.
It's the average over the last 5 minutes.
The rate() function returns the average per-second increase rate for the counter passed to it. The average rate is calculated over the lookbehind window passed in square brackets to rate().
For example, rate(unifi_devices_wireless_received_bytes_total[5m]) calculates the average per-second increase rate over the last 5 minutes. It returns lower than expected rate when 100MB of data in transferred in 10 seconds, because it divides those 100MB by 5 minutes and returns the average data transfer speed as 100MB/5minutes = 333KB/s instead of 10MB/s.
Unfortinately, using 10s as a lookbehind window doesn't work as expected - it is likely the rate(unifi_devices_wireless_received_bytes_total[10s]) would return nothing. This is because rate() in Prometheus expects at least two raw samples on the lookbehind window. This means that new samples must be written at least every 5 seconds or more frequently into Prometheus for [10s] lookbehind window. The solution is to use irate() function instead of rate():
irate(unifi_devices_wireless_received_bytes_total[5m])
It is likely this query would return data transfer rate, which is closer to the expected 10MBs if the interval between raw samples (aka scrape_interval) is lower than 10 seconds.
Unfortunately, it isn't recommended to use irate() function in general case, since it tends to return jumpy results when refreshing graphs on big time ranges. Read this article for details.
So the ultimate solution is to use rollup_rate function from VictoriaMetrics - the project I work on. It reliably detects spikes in counter rates by returning the minimum, maximum and average per-second increase rate across all the raw samples on the selected time range.

Set Batch Size *and* Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?
This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

keras fit_generator parameter steps_per_epoch

I want to use the keras model.fit_generator method for that I wrote my own generator and for the method I need to define the parameter "steps_per_epoch" I want to use every training data once for every epoch.
Now my problem is I generate the features in the generator I read wav-files and create the fft and before I start the training I don't know how much batches/samples I have. I can calculate the fft for every file before I start using the fit_generator method but every time I change my dataset(>20GB) I would need to recalculate the fft for every file and save the count for steps per epoch. Is there a better way how I can define that the fit_generator uses every sample only one time without calculate the steps per epoch? Or can my own generator pass the fit_generator when to start a new epoch?
Here is the code for my generator
def my_generator(filename_list):
while True:
for fname in filename_list:
data, sr = librosa.load(fname)
fft_result = librosa.core.stft(data)
batches = features.create_batches(fft_result, batch_size)
for i in range(len(batches)):
yield (batches[i], label)
model.fit_generator(my_generator(filename_list=filename_list, batch_size=batch_size), steps_per_epoch=100, epochs=10)
For each file in the list, you have to calculate fft that has 'n' batches where 'n' is different for each file. If this is the case than:
Navie method is to loop through the batch generator to calculate the actual number of batches. This process needs to be done only once. you can save that number for future use as well.
The second method could be to assign an arbitrary number to step_per_epoch. That arbitrary number should be greater than or equal to the number of files in the list multiplied by the number of the batches each fft can generate. The number of fft batches could be an arbitrary number. This way, if you shuffle data after the external "for" loop completes, then after some epoch statistically speaking all training data would be seen by the model. By using early_stop you can have properly converged model where "epochs" should be a very large value, 1000 for example.

Re-Use Sliding Window data for Neural Network for Time Series?

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.
If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.
You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.

Clustering a sequence with time stamps (a time series data of two events)

Have been exploring different options for clustering a time series data that is of the type :
two different events - say 1,2
events time(nanos)
1 1e3
1 6e3
1 8e3
2 12e3
1 54e3
1 58e3
1 62e3
1 67e3
1 70e3
1 75e3
2 103e3
2 108e3
2 114e3
etc etc
ie., the times are stochastic (exponentially distributed) and either event 1 or event 2 is recorded. the recordings are in nanoseconds. The data set is large, going upto 15-20 mts, and with millions of points
The events are correlated and thus a bunch of 2s or 1s could happen. For eg., There will be small pieces (1 millisecond long pieces having 100-200 events of both types). Some cases, there will be a series of just one event type happening which needs to be discarded.
And most of the time, just single or few events are recorded & this is just noise (>80% of the data).
This is clearly a time series data, with event type information.
I would like to apply a clustering methodology to identify the meaningful small pieces. I'm using Matlab and have tried to look into options such as DBSCAN, k-means (not useful since I don't know the number of clusters apriori) etc.,
(the recording times themselves could be taken as a 'distance' since these are sequential chunks. ie., dist(x1,x2) = abs( x2(2) - x1(2) ) if x is (event, time) ;
also, a meaningful sequence of events happening at say time = 10.2 to 10.23 seconds, has no relationship to any other piece. ie., the clustering is done only to "identify" the short pieces (expected to be few 10000s out of the whole dataset)
Any help would be appreciated ! Thanks.
What about taking the difference between time points and determining either empirically or statistically a threshold below which the events are "connected"?
dtimes=diff(nanotimes);
THRESH=100; % completely made up - will depend on your data
current_cluster=1;
assign_clusters=zeros(size(nanotimes));
assign_clusters(1)=current_cluster;
for (v=1:length(dtimes))
if (dtimes(v)>THRESH)
current_cluster=current_cluster+1;
end
assign_clusters(1+v)=current_cluster;
end
for v=1:current_cluster
indices=find(assign_clusters==v);
if (~any(events(indices)==1)) || ...
all(events(indices)==1) || ...
(nanotimes(indices(end))-nanotimes(indices(1)) < TIMETHRESH)
assign_clusters(indices)=-1;
end
end
You probably are looking in the wrong domain.
Cluster analysis is meant for multidimensional data, but you have just one true dimension, time.
You really should look at classic statistical methods for series, such as kernel density estimation, natural breaks optimization and such things.
For example, you could estimate the density of events 1 and event 2 using a kernel density estimator, then split the data set whenever the density of event 1 or event 2 becomes higher than the other by a certain threshold. It's actually quite straightforward, once you compute the KDE curves.