Comparing Rete and sequential algorithmes - rule-engine

Which (Rete or sequential) algorithm is more efficient when rules data is known upfront and not changing?

If your data is not modified and there is no chaining or inference then Rete is not necessary - Rete is better for systems where data changes incrementally over time.
There are other algorithms like Phreak that support inference and sequential modes, that offers a reasonable compromise between the two.
http://blog.athico.com/2013/11/rip-rete-time-to-get-phreaky.html
The biggest difference would probably come from code generated rules vs a general engine processing a data structure to represent rules. It's easier to code generate for sequential execution. Although again you could probably code generate for the Phreak algorithm's sequential mode - which would close the gap.

Related

Does functional programming reduce the Von Neumann bottleneck?

I believe (from doing some reading) that reading/writing data across the bus from CPU caches to main memory places a considerable constraint on how fast a computational task (which needs to move data across the bus) can complete - the Von Neumann bottleneck.
I have come across a few articles so far which mention that functional programming can be more performant than other paradigms like the imperative approach eg. OO (in certain models of computation).
Can someone please explain some of the ways that purely functional programming can reduce this bottleneck? ie. are any of the following points found (in general) to be true?
Using immutable data structures means generally less data is moving across that bus - less writes?
Using immutable data structures means that data is possibly more likely to be hanging around in CPU cache - because less updates to existing state means less flushing of objects from cache?
Is it possible that using immutable data structures means that we may often never even read the data back from main memory because we may create the object during computation and have it in local cache and then during same time slice create a new immutable object off of it (if there is a need for an update) and we then never use original object ie. we are working a lot more with objects that are sitting in local cache.
Oh man, that’s a classic. John Backus’ 1977 ACM Turing Award lecture is all about that: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs.” (The paper, “Lambda: The Ultimate Goto,” was presented at the same conference.)
I’m guessing that either you or whoever raised this question had that lecture in mind. What Backus called “the von Neumann bottleneck” was “a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store).”
CPUs do still have a data bus, although in modern computers, it’s usually wide enough to hold a vector of words. Nor have we gotten away from the problem that we need to store and look up a lot of addresses, such as the links to daughter nodes of lists and trees.
But Backus was not just talking about physical architecture (emphasis added):
Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself but where to find it.
In that sense, functional programming has been largely successful at getting people to write higher-level functions, such as maps and reductions, rather than “word-at-a-time thinking” such as the for loops of C. If you try to perform an operation on a lot of data in C, today, then just like in 1977, you need to write it as a sequential loop. Potentially, each iteration of the loop could do anything to any element of the array, or any other program state, or even muck around with the loop variable itself, and any pointer could potentially alias any of these variables. At the time, that was true of the DO loops of Backus’ first high-level language, Fortran, as well, except maybe the part about pointer aliasing. To get good performance today, you try to help the compiler figure out that, no, the loop doesn’t really need to run in the order you literally specified: this is an operation it can parallelize, like a reduction or a transformation of some other array or a pure function of the loop index alone.
But that’s no longer a good fit for the physical architecture of modern computers, which are all vectorized symmetric multiprocessors—like the Cray supercomputers of the late ’70s, but faster.
Indeed, the C++ Standard Template Library now has algorithms on containers that are totally independent of the implementation details or the internal representation of the data, and Backus’ own creation, Fortran, added FORALL and PURE in 1995.
When you look at today’s big data problems, you see that the tools we use to solve them resemble functional idioms a lot more than the imperative languages Backus designed in the ’50s and ’60s. You wouldn’t write a bunch of for loops to do machine learning in 2018; you’d define a model in something like Tensorflow and evaluate it. If you want to work with big data with a lot of processors at once, it’s extremely helpful to know that your operations are associative, and therefore can be grouped in any order and then combined, allowing for automatic parallelization and vectorization. Or that a data structure can be lock-free and wait-free because it is immutable. Or that a transformation on a vector is a map that can be implemented with SIMD instructions on another vector.
Examples
Last year, I wrote a couple short programs in several different languages to solve a problem that involved finding the coefficients that minimized a cubic polynomial. A brute-force approach in C11 looked, in relevant part, like this:
static variable_t ys[MOST_COEFFS];
// #pragma omp simd safelen(MOST_COEFFS)
for ( size_t j = 0; j < n; ++j )
ys[j] = ((a3s[j]*t + a2s[j])*t + a1s[j])*t + a0s[j];
variable_t result = ys[0];
// #pragma omp simd reduction(min:y)
for ( size_t j = 1; j < n; ++j ) {
const variable_t y = ys[j];
if (y < result)
result = y;
} // end for j
The corresponding section of the C++14 version looked like this:
const variable_t result =
(((a3s*t + a2s)*t + a1s)*t + a0s).min();
In this case, the coefficient vectors were std::valarray objects, a special type of object in the STL that have restrictions on how their components can be aliased, and whose member operations are limited, and a lot of the restrictions on what operations are safe to vectorize sound a lot like the restrictions on pure functions. The list of allowed reductions, like .min() at the end, is, not coincidentally, similar to the instances of Data.Semigroup. You’ll see a similar story these days if you look through <algorithm> in the STL.
Now, I’m not going to claim that C++ has become a functional language. As it happened, I made all the objects in the program immutable and automatically collected by RIIA, but that’s just because I’ve had a lot of exposure to functional programming and that’s how I like to code now. The language itself doesn’t impose such things as immutability, garbage collection or absence of side-effects. But when we look at what Backus in 1977 said was the real von Neumann bottleneck, “an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand,” does that apply to the C++ version? The operations are linear algebra on coefficient vectors, not word-at-a-time. And the ideas C++ borrowed to do this—and the ideas behind expression templates even more so—are largely functional concepts. (Compare that snippet to how it would’ve looked in K&R C, and how Backus defined a functional program to compute inner product in section 5.2 of his Turing Award lecture in 1977.)
I also wrote a version in Haskell, but I don’t think it’s as good an example of escaping that kind of von Neumann bottleneck.
It’s absolutely possible to write functional code that meets all of Backus’ descriptions of the von Neumann bottleneck. Looking back on the code I wrote this week, I’ve done it myself. A fold or traversal that builds a list? They’re high-level abstractions, but they’re also defined as sequences of word-at-a-time operations, and half or more of the data passed through the bottleneck when you create and traverse a singly-linked list is the addresses of other data! They’re efficient ways to put data through the von Neumann bottleneck, and that’s basically why I did it: they’re great patterns for programming von Neumann machines.
If we’re interested in coding a different way, however, functional programming gives us tools to do so. (I’m not going to claim it’s the only thing that does.) Express a reduction as a foldMap, apply it to the right kind of vector, and the associativity of the monoidal operation lets you split up the problem into chunks of whatever size you want and combine the pieces later. Make an operation a map rather than a fold, on a data structure other than a singly-linked list, and it can be automatically parallelized or vectorized. Or transformed in other ways that produce the same result, since we’ve expressed the result at a higher level of abstraction, not a particular sequence of word-at-a-time operations.
My examples so far have been about parallel programming, but I’m sure quantum computing will shake up what programs look like a lot more fundamentally.

How to employ Learning-to-rank models (CNN, LSTM) in short-pair ranking?

In common applied learn-to-rank tasks, the inputs are usually semantic and have good syntactic structure, like Question-Answer ranking tasks. In this scenario, CNN or LSTM is a good structure to capture the latent information (local or long dependency) of QA-pairs.
But in reality, sometimes we just have short pair and discrete words. In this occasion, CNN or LSTM is still a fair choice?Or is there some more appropriate method can handle this?
The bigger question is how much training data you have. There's a lot of interesting work, but the reason that the deep neural network approaches tend to use QA ranking tasks is because those tasks typically have hundreds of thousands or millions of training examples.
When you have shorter queries, i.e. title or web queries, you will possibly need even more data to learn, because less of the network will be exercised by each training instance. It is possible, but the method you choose should be based on the training data you have available, rather than the size of your queries, in general.
[0-50 queries] -> Hand-tuned, time-tested model such as Query Likelihood, BM25, (or if you want better results, ngram models such as SDM) (if you want more recall, pseudo-relevance-feedback models such as RM3).
[50-1000 queries] -> Linear or Tree-based learning-to-rank methods
[1000-millions] -> Deep approach, or possibly still learning-to-rank. I'm not sure any of the deep papers have truly dominated a state-of-the-art gradient-boosted-regression-tree setup.
A recent paper by one of my labmates used pseudo-labels from BM25 to bootstrap a DNN. They got good results (better than BM25), but they literally had to be Google (training-time-wise) to pull it off.

Factors that Impact Translation Time

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.
Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

Scala : How to speed up parallel synchronous processing on a small list?

I see that Parallel collections are mostly designed to speed up processing of large collections but it is not mentioned if this can be helpful for small lists or not.
Check this example:
List(1,2,3).map(loadHeavyFile(_))
List(1,2,3).par.map(loadHeavyFile(_)).toList
What I want to know here is that :
Is one faster than another?
Will Scala use multiple threads if the list only has 3 elements?
In a general way, is it possible to speed up the response time of this code?
I know I can use Future, then Future.sequence and wait for the result to come up, but this seems unnatural when loadHeavyFile is a synchronous call. I don't want to have to specify a timeout for example.
Note: I want to preserve list ordering.
Any guidance here is appreciated.
In parallel collections Measuring Performance - How big should a collection be to go parallel? note some parameters that are useful for estimating and deciding whether a parallel version of a collection may outperform the non-parallel one.
For such a small list, the overhead of creating the parallel version may be larger than the actual sequential mapping; however in this example the focus lies on I/O performance and possibly on memory management, should the involved files be large and be uploaded to memory.
To provide a more factual answer, consider benchmarking the sequential and parallel versions for different I/O loads in order to estimate roughly which heavy I/O load threshold makes the parallel version worth it. The I/O factor makes the estimation OS and hardware dependent and it may prove it hard to generalise.

Which language should I prefer working with if I want to use the Fast Artificial Neural Network Library (FANN)?

I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).