H2O.ai mini_batch_size is really used? - neural-network

In the documentation of H2O is written:
mini_batch_size: Specify a value for the mini-batch size. (Smaller values lead to a better fit; larger values can speed up and generalize better.)
but when I run a model using the FLOW UI (with mini_batch_size > 1) in the log file is written:
WARN: _mini_batch_size Only mini-batch size = 1 is supported right now.
so the question: is the mini_batch_size really used??

It appears to be a left over from preparation for a DeepWater integration that never happened. E.g. https://github.com/h2oai/h2o-3/search?l=Java&p=2&q=mini_batch_size
That makes sense, because the Hogwild! algorithm, that H2O's deep learning uses, does away with the need for batching training data.
To sum up, I don't think it is used.

Related

Loss of TransE gets plateau at the margin value with BigDL library

I have been trying to use optimizer(SGD, Adagrad) from BigDL library on TransE with scala. My current implementation works with mini batch in sequential way. I followed this example to optimize the embeddings(as Tensors) without creating a layered model.My code is somewhat similar to this example. My current problem is, with some parameters my losses gets at a plateau point (the value of margin) no matter how many epochs I run. With this, my hit#10 in testing is not that good. Can someone give any idea why losses get at a plateaued point and if this generates bad testing results?
P.S. I have checked my loss calculation and it is fine. The only place I have control over my implementation is the optimizer.
Thanks in advance.

Barriers to translation stage in Modelica?

Some general Modelica advice?
We've built a model with ~2000 equations and three vectors of input from measured data. Using OpenModelica, attempts at simulation have begun to hang in the translation stage (which runs for hours where it used to take less than a minute) and now I regularly "lose connection to omc.exe." Is there perhaps something cumulative occurring that's degrading translation/compilation performance?
In general, are there any good rules of thumb for keeping simulations lighter and faster? I realize that, depending on the couplings, additional equations could be exponentially increasing the size of the resulting system of equations - could this be a problem?
Thanks for your thoughts!
It shouldn't take that long. Seems like a bug.
You can report this bug here:
https://trac.openmodelica.org/OpenModelica (New Ticket).
If your model is public you can post it there, if not you can contact the OpenModelica team privately.
I did some cleaning in the code; and got the part that repeats 12x (the module) down to ~180 equations; in the process I reduced the size of my input vectors (and also a 2D look-up table the module refers to) by quite a bit - they're both down to a few hundred values. It's working now--simulations run in reasonable time, a few minutes each.
Since all these tables were defined within Modelica functions (as you pointed out, Mr. Tiller) perhaps shrinking them helped to improve the performance. I had assumed that all that data just got spread out in a memory array, without going through any real processing, but maybe that's not the case...time to know more about what's going on under the hood in this environment (as always).
Thanks for the help!

Scala streaming peak detection with reactive events

I am trying to work out the best way to structure an application that in essence is a peak detection program. In my line of work I have been given charge of developing a system that essentially is looking at pulses in a stream of data and doing calculations on the peak data.
At the moment the software is implemented in LabVIEW. I'm sure many of you on here would understand why I'd love to see the end of that environment. I would like to redesign this in Scala (and possibly use Play if I was to make it use a web frontend) but I am not sure how best to approach the initial peak-detection component.
I've seen many tutorials for peak detection in various languages and I understand from a theoretical perspective many of the algorithms. What I am not sure is how would I approach this from the most Scala/Play idiomatic way?
Obviously I don't expect someone to write the code for me but I would really appreciate any pointers as to the direction I should take that makes the most sense. Since I cannot be too specific on the use case I'll try to give an overview of what I'm trying to do below:
Interfacing with data acquisition hardware to send out control voltages and read back "streams" of data.
I should be able to work the hardware side out, but is there a specific structure that would be best for the returned stream? I don't necessarily know ahead of time how much data I'll be reading so a stream that can be buffered and chunked would probably be appropriate.
Scan through the stream to find peaks and measure their height and trigger an event.
Peaks are usually about 20 samples wide or so but that depends on sample rate so I don't want to hard-code anything like that. I assume a sliding window would be necessary so peaks don't get "cut off" on the edge of a buffer. As a peak arrives I need to record and act on it. I think reactive streams and so on may be appropriate but I'm not sure. I will be making live graphs etc with the data so however it is done I need a way to send an event immediately on a successful detection.
The streams can be quite long and are at high sample-rates (minimum of 250ksamples per second) so I'd prefer not to have to buffer the entire stream to memory. The only information that needs to be permanent is the peak voltage data. I will need a way to visualise the raw stream for calibration purposes but I imagine that should be pretty simple.
The full application is much more complex and I'll need to do some initial filtering of noise and drift but I believe I should be able to work that out once I know what kind of implementation I should build on.
I've tried to look into Play's Iteratees and such but they are a little hard to follow. If they are an appropriate fit then I'm happy to work on learning them but since I'm not sure if that is the best way to approach the problem I'd love to know where I should look.
Reactive frameworks and the like certainly look interesting and I can see how I could really easily build the rest of the application around them but I'm just not sure how best to implement a streaming peak detection function on top of them beyond something simple like triggering when a value is over a threshold (as mentioned previously a "peak" can be quite wide and the signal is noisy).
Any advice would be greatly appreciated!
This is not a solution to this question but I'm writing this as an answer because of space/formatting limitations in the comments section.
Since you are exploring options I would suggest the following:
Assuming you have a large enough buffer to keep a window of data in memory (W=tXw) you can calculate the peak for the buffer using your existing algorithm. Next you can collect the next few samples data in a delta buffer (d) (a much smaller window). The delta buffer is the size of your increment. Assuming this is time series data you can easily create the new sliding window by removing the first delta (dXt) values from the buffer W and adding d values to the buffer. This is how Spark-streaming implements reduceByWindow function on a DStream. Iteratee can also help here.
If your system is distributed then you can use stream processing systems (Storm, Spark-streaming) to get better latency and throughput at the cost of distributing the system.
If you are really resource constrained and can live approximate results that bounded I would suggest you look at implementing a combination of probabilistic data structures such as count-min-sketch, hyperloglog and bloom filter.

Virtual Memory Page Replacement Algorithms

I have a project where I am asked to develop an application to simulate how different page replacement algorithms perform (with varying working set size and stability period). My results:
Vertical axis: page faults
Horizontal axis: working set size
Depth axis: stable period
Are my results reasonable? I expected LRU to have better results than FIFO. Here, they are approximately the same.
For random, stability period and working set size doesnt seem to affect the performance at all? I expected similar graphs as FIFO & LRU just worst performance? If the reference string is highly stable (little branches) and have a small working set size, it should still have less page faults that an application with many branches and big working set size?
More Info
My Python Code | The Project Question
Length of reference string (RS): 200,000
Size of virtual memory (P): 1000
Size of main memory (F): 100
number of time page referenced (m): 100
Size of working set (e): 2 - 100
Stability (t): 0 - 1
Working set size (e) & stable period (t) affects how reference string are generated.
|-----------|--------|------------------------------------|
0 p p+e P-1
So assume the above the the virtual memory of size P. To generate reference strings, the following algorithm is used:
Repeat until reference string generated
pick m numbers in [p, p+e]. m simulates or refers to number of times page is referenced
pick random number, 0 <= r < 1
if r < t
generate new p
else (++p)%P
UPDATE (In response to #MrGomez's answer)
However, recall how you seeded your input data: using random.random,
thus giving you a uniform distribution of data with your controllable
level of entropy. Because of this, all values are equally likely to
occur, and because you've constructed this in floating point space,
recurrences are highly improbable.
I am using random, but it is not totally random either, references are generated with some locality though the use of working set size and number page referenced parameters?
I tried increasing the numPageReferenced relative with numFrames in hope that it will reference a page currently in memory more, thus showing the performance benefit of LRU over FIFO, but that didn't give me a clear result tho. Just FYI, I tried the same app with the following parameters (Pages/Frames ratio is still kept the same, I reduced the size of data to make things faster).
--numReferences 1000 --numPages 100 --numFrames 10 --numPageReferenced 20
The result is
Still not such a big difference. Am I right to say if I increase numPageReferenced relative to numFrames, LRU should have a better performance as it is referencing pages in memory more? Or perhaps I am mis-understanding something?
For random, I am thinking along the lines of:
Suppose theres high stability and small working set. It means that the pages referenced are very likely to be in memory. So the need for the page replacement algorithm to run is lower?
Hmm maybe I got to think about this more :)
UPDATE: Trashing less obvious on lower stablity
Here, I am trying to show the trashing as working set size exceeds the number of frames (100) in memory. However, notice thrashing appears less obvious with lower stability (high t), why might that be? Is the explanation that as stability becomes low, page faults approaches maximum thus it does not matter as much what the working set size is?
These results are reasonable given your current implementation. The rationale behind that, however, bears some discussion.
When considering algorithms in general, it's most important to consider the properties of the algorithms currently under inspection. Specifically, note their corner cases and best and worst case conditions. You're probably already familiar with this terse method of evaluation, so this is mostly for the benefit of those reading here whom may not have an algorithmic background.
Let's break your question down by algorithm and explore their component properties in context:
FIFO shows an increase in page faults as the size of your working set (length axis) increases.
This is correct behavior, consistent with Bélády's anomaly for FIFO replacement. As the size of your working page set increases, the number of page faults should also increase.
FIFO shows an increase in page faults as system stability (1 - depth axis) decreases.
Noting your algorithm for seeding stability (if random.random() < stability), your results become less stable as stability (S) approaches 1. As you sharply increase the entropy in your data, the number of page faults, too, sharply increases and propagates the Bélády's anomaly.
So far, so good.
LRU shows consistency with FIFO. Why?
Note your seeding algorithm. Standard LRU is most optimal when you have paging requests that are structured to smaller operational frames. For ordered, predictable lookups, it improves upon FIFO by aging off results that no longer exist in the current execution frame, which is a very useful property for staged execution and encapsulated, modal operation. Again, so far, so good.
However, recall how you seeded your input data: using random.random, thus giving you a uniform distribution of data with your controllable level of entropy. Because of this, all values are equally likely to occur, and because you've constructed this in floating point space, recurrences are highly improbable.
As a result, your LRU is perceiving each element to occur a small number of times, then to be completely discarded when the next value was calculated. It thus correctly pages each value as it falls out of the window, giving you performance exactly comparable to FIFO. If your system properly accounted for recurrence or a compressed character space, you would see markedly different results.
For random, stability period and working set size doesn't seem to affect the performance at all. Why are we seeing this scribble all over the graph instead of giving us a relatively smooth manifold?
In the case of a random paging scheme, you age off each entry stochastically. Purportedly, this should give us some form of a manifold bound to the entropy and size of our working set... right?
Or should it? For each set of entries, you randomly assign a subset to page out as a function of time. This should give relatively even paging performance, regardless of stability and regardless of your working set, as long as your access profile is again uniformly random.
So, based on the conditions you are checking, this is entirely correct behavior consistent with what we'd expect. You get an even paging performance that doesn't degrade with other factors (but, conversely, isn't improved by them) that's suitable for high load, efficient operation. Not bad, just not what you might intuitively expect.
So, in a nutshell, that's the breakdown as your project is currently implemented.
As an exercise in further exploring the properties of these algorithms in the context of different dispositions and distributions of input data, I highly recommend digging into scipy.stats to see what, for example, a Gaussian or logistic distribution might do to each graph. Then, I would come back to the documented expectations of each algorithm and draft cases where each is uniquely most and least appropriate.
All in all, I think your teacher will be proud. :)

What is the meaning of "drop" and "sgd" while training custom ner model using spacy?

I am training a custom ner model to identify organization name in addresses.
My training loop looks like this:-
for itn in range(100):
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA, size=compounding(15., 32., 1.001))
for batch in batches
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.25, losses=losses)
print('Losses', losses)
Can someone explain the parameters "drop", "sgd", "size" and give some ideas to how should I change these values, so that my model performs better.
You can find details and tips in the spaCy documentation:
https://spacy.io/usage/training#tips-batch-size:
The trick of increasing the batch size is starting to become quite popular ... In training the various spaCy models, we haven’t found much advantage from decaying the learning rate – but starting with a low batch size has definitely helped
batch_size = compounding(1, max_batch_size, 1.001)
This will set the batch size to start at 1, and increase each batch until it reaches a maximum size.
https://spacy.io/usage/training#tips-dropout:
For small datasets, it’s useful to set a high dropout rate at first, and decay it down towards a more reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying utility function to facilitate this. You might try setting:
dropout = decaying(0.6, 0.2, 1e-4)
https://spacy.io/usage/training#annotations:
sgd: An optimizer, i.e. a callable to update the model’s weights. If not set, spaCy will create a new one and save it for further use.
The drop, sgd and size are some of the parameters you can customize to optimize your training.
drop is used to change the value of dropout.
size is used to change the size of the batch
sgd is used to change various hyperparameters such as learning rate, Adam beta1 and beta2 parameters, gradient clipping and L2 regularisation.
I consider the sgd to be a very important argument to experiment with.
To help you, I wrote a short blog post showing how to customize any spaCy parameters from your python interpreter (e.g. jupyter notebook). No command line interface required.