What is the smallest unit of work that is sensible to parallelize with actors? - scala

While Scala actors are described as light-weight, Akka actors even more so, there is obviously some overhead to using them.
So my question is, what is the smallest unit of work that is worth parallelising with Actors (assuming it can be parallelized)? Is it only worth it if there is some potentially latency or there are a lot of heavy calculations?
I'm looking for a general rule of thumb that I can easily apply in my everyday work.
EDIT: The answers so far have made me realise that what I'm interested in is perhaps actually the inverse of the question that I originally asked. So:
Assuming that structuring my program with actors is a very good fit, and therefore incurs no extra development overhead (or even incurs less development overhead than a non-actor implementation would), but the units of work it performs are quite small - is there a point at which using actors would be damaging in terms of performance and should be avoided?

Whether to use actors is not primarily a question of the unit of work, its main benefit is to make concurrent programs easier to get right. In exchange for this, you need to model your solution according to a different paradigm.
So, you need to decide first whether to use concurrency at all (which may be due to performance or correctness) and then whether to use actors. The latter is very much a matter of taste, although with Akka 2.0 I would need good reasons not to, since you get distributability (up & out) essentially for free with very little overhead.
If you still want to decide the other way around, a rule of thumb from our performance tests might be that the target message processing rate should not be higher than a few million per second.

My rule of thumb--for everyday work--is that if it takes milliseconds then it's potentially worth parallelizing. Although the transaction rates are higher than that (usually no more than a few 10s of microseconds of overhead), I like to stay well away from overhead-dominated cases. Of course, it may need to take much longer than a few milliseconds to actually be worth parallelizing. You always have to balance time time taken by writing more code against the time saved running it.

If no side effects are expected in work units then it is better to make decision for work splitting in run-time:
protected T compute() {
if (r – l <= T1 || getSurplusQueuedTaskCount() >= T2)
return problem.solve(l, r);
// decompose
}
Where:
T1 = N / (L * Runtime.getRuntime.availableProcessors())
N - Size of work in units
L = 8..16 - Load factor, configured manually
T2 = 1..3 - Max length of work queue after all stealings
Here is presentation with much more details and figures:
http://shipilev.net/pub/talks/jeeconf-May2012-forkjoin.pdf

Related

Can I use ScalaMeter with no input?

I want to benchmark the runtime of several methods in my Scala application, and I am looking into using ScalaMeter. Let's say I want to measure the time of a method called doSomething().
I only want to call doSomething and measure the time it takes to run once. However all of the documentation I see for ScalaMeter requires providing some kind of input, whether it is a series of integers, a string, or something.
Is it possible to use ScalaMeter to do what I am asking? Is it an appropriate use case?
It is possible, but it would be a waste of time.
As you're probably aware, ScalaMeter is designed to remove the effects of variation in function execution times, so that it's possible to accurately benchmark those execution times. For example, you might want to verify that a function completes within a required time, or to determine whether its performance is maintained over time as changes are made to the code base.
Why is that so challenging? Well, there's a number of obstacles to overcome:
The JVM has a number of different options for executing the resulting Java bytecode in a program. Some (such as the Zero VM) just interprets code; others utilize just-in-time (JIT) compilation to optimize translation into the host CPU's machine code; the HotSpot Server VM aggressively improves performance over time, so that code performance incrementally improves the longer it runs. For benchmarking purposes, the HotSpot Client VM performs very good optimization and reaches a steady-state quickly, which therefore allows us to start measuring performance rapidly. However, we still need to allow the JIT compiler to warm up, and so we must disregard the first few, slower executions (runs) that would otherwise bias our results. ScalaMeter does a pretty good job of undertaking this warmup by itself, but the number of runs to be discarded is configurable.
The JVM performs a number of garbage collection (GC) cycles, seemingly at random, which can similarly slow down performance when they occur. ScalaMeter can be configured to ignore executions in which GC cycles occurred.
The host machine's load can vary as it executes threads from other processes running on the same machine. These also potentially slow down execution times. ScalaMeter deals with this by considering only the fastest observed time in a fixed number of runs, rather than by taking an average.
If you're running from SBT, a forked JVM execution session will perform better, and with less variation, than one that shares the same JVM instance as SBT (because more of the SBT JVM's resources will be in use).
Virtual memory page faults (in which the memory making up the application's working set is switched to/from a paging file) will also randomly impact performance.
The performance of many functions will depend upon its arguments (and, if you're not into functional programming, shared mutuble state). Tying performance to argument values is also something ScalaMeter is good at, through it's use of generators. (For example, consider a size operation on a List—it will clearly take longer to execute as the number of elements in the List increases.)
Etc. You can find more on these issues in the ScalaMeter Getting Started Introduction.
Clearly, benchmarks should be performed on the same host machine so that the results are comparable, since CPU, OS, Memory, BIOS Config, etc. all affect performance too.
So, having explained all that, you will understand why ScalaMeter needs to execute the same function a lot! ;-)
In your case, doSomething() takes no arguments, so you can use a Gen[T].single generator that identifies the class or object to which doSomething() belongs, which will look something like the following:
Note: This is written as a ScalaMeter test, and so the source should be under src/test/scala:
import org.scalameter.api._
import org.scalameter.picklers.Implicits._
object MyBenchmark
extends Bench.ForkedTime {
// We have no arguments. Instead, create a single "generator" that identifies the class or
// object that doSomething belongs to. This assumes doSomething() belongs to object
// MyObject.
val owner = Gen.single("owner")(MyObject)
// Measure MyObject.doSomething()'s performance.
performance of "MyObject" in {
measure method "doSomething()" in {
using(owner) in {
_.doSomething()
}
}
}
}
(BTW: I would have thought that benchmarking functions with no arguments would be more straightforward than this, but this is the best I've been able to come up with so far. If anyone has a better idea, please add a comment and let me know!)
So, if all of that is overkill, you might want to try something like this:
// Measure nanoseconds taken to execute by name argument.
def measureTime(x: => Unit): Long = {
val start = System.nanoTime()
x
// Calculate how long that took and return the value.
System.nanoTime() - start
}
measureTime {
doSomething()
}
You'll only execute the function once, and the time taken will be wildly different each time.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

eliminate FIX layer to increase performance

does it makes sense? a protocol designed for speed as well as resiliency that eliminates the FIX layer for high performance order execution?
The FAST protocol is intended to be a "faster" version of the FIX protocol. The quantity of extra processing it requires means that it is only faster "on the wire" however and so will not be very effective for those with boxes at exchange. #dumbcoder is, as usual, correct about optimization and high-powered machines being the best way of reducing latencies. That FIX isn't inherently slow, dependent on your implementation, is also very important. Sell-side and HFT implementations are much faster than the cheaper ones used by the hedgies and investors.
Are you asking whether exchanges should adopt non-fix protocols for receiving market messages? Some already have alternatives (e.g. NASDAQ's ITCH and OUCH). But they don't 'eliminate' the FIX layer - they still provide the same function, they just go about it in a different way.
FIX actually doesn't have to be all that slow - if you treat messages as byte arrays (instead of one big string) and then only get out exactly what you need (which, for order acceptances, fills, etc., can be very few tags), then FIX is really not that bad.
The key selling point of FIX is that it is an industry standard. Exchanges are free to develop their own proprietary protocols which can be higher performance, but the fact that everyone can write to a single protocol is a big deal (even if it is not always implemented in the most efficient manner).
Another angle which should be explored is whether there is a separate communication channel for the protocol or one is implemented as a wrapper on top of other.
Working on FIX is definitely an advantage that implementation remains portable with slight modification across exchanges.

Streams vs. tail recursion for iterative processes

This is a follow-up to my previous question.
I understand that we can use streams to generate an approximation of 'pi' (and other numbers), n-th fibonacci, etc. However I doubt if streams is the right approach to do that.
The main drawback (as I see it) is memory consumption: e.g. stream will retains all fibonacci numbers for i < n while I need only fibonacci n-th. Of course, I can use drop but it makes the solution a bit more complicated. The tail recursion looks like a more suitable approach to the tasks like that.
What do you think?
If need to go fast, travel light. That means; avoid allocation of any unneccessary memory. If you need memory, use the fastast collections available. If you know how much memory you need; preallocate. Allocation is the absolute performance killer... for calculation. Your code may not look nice anymore, but it will go fast.
However, if you're working with IO (disk, network) or any user interaction then allocation pales. It's then better to shift priority from code performance to maintainability.
Use Iterator. It does not retain intermediate values.
If you want n-th fibonacci number and use a stream just as a temporary data structure (if you do not hold references to previously computed elements of stream) then your algorithm would run in constant space.
Previously computed elements of a Stream (which are not used anymore) are going to be garbage collected. And as they were allocated in the youngest generation and immediately collected, allmost all allocations might be in cache.
Update:
It seems that the current implementation of Stream is not as space-efficient as it may be, mainly because it inherits an implementation of apply method from LinearSeqOptimized trait, where it is defined as
def apply(n: Int): A = {
val rest = drop(n)
if (n < 0 || rest.isEmpty) throw new IndexOutOfBoundsException("" + n)
rest.head
}
Reference to a head of a stream is hold here by this and prevents the stream from being gc'ed. So combination of drop and head methods (as in f.drop(100).head) may be better for situations where dropping intermediate results is feasible. (thanks to Sebastien Bocq for explaining this stuff on scala-user).

Cocoa Touch Programming. KVO/KVC in the inner loop is super slow. How do I speed things up?

I've become a huge fan of KVO/KVC. I love the way it keeps my MVC architecture clean. However I am not in love with the huge performance hit I incur when I use KVO within the inner rendering loop of the 3D rendering app I'm designing where messages will fire at 60 times per second for each object under observation - potentially hundreds.
What are the tips and tricks for speeding up KVO? Specifically, I am observing a scalar value - not an object - so perhaps the wrapping/unwrapping is killing me. I am also setting up and tearing down observation
[foo addObserver:bar forKeyPath:#"fooKey" options:0 context:NULL];
[foo removeObserver:bar forKeyPath:#"fooKey"];
within the inner loop. Perhaps I'm taking a hit for that.
I really, really, want to keep the huge flexibility KVO provides me. Any speed freaks out there who can lend a hand?
Cheers,
Doug
Objective-C's message dispatch and other features are tuned and pretty fast for what they provide, but they still don't approach the potential of tuned C for computational tasks:
NSNumber *a = [NSNumber numberWithIntegerValue:(b.integerValue + c.integerValue)];
is way slower than:
NSInteger a = b + c;
and nobody actually does math on objects in Objective-C for that reason (well that and the syntax is awful).
The power of Objective-C is that you have a nice expressive message based object system where you can throw away the expensive bits and use pure C when you need to. KVO is one of the expensive bits. I love KVO, I use it all the time. It is computationally expensive, especially when you have lots of observed objects.
An inner loop is that small bit of code you run over and over, anything thing there will be done over and over. It is the place where you should be eliminating OOP features if need be, where you should not be allocating memory, where you should be considering replacing method calls with static inline functions. Even if you somehow manage to get acceptable performance in your rendering loop, it will be much lower performance than if you got all that expensive notification and dispatch logic out of there.
If you really want to try to keep it going with KVO here are a few things you can try to make things go faster:
Switch from automatic to manual KVO in your objects. This may allow you to reduce spurious notifications
Aggregate updates: If your intermediate values over some time interval are not relevant, and you can defer for some amount of time (like the next animation frame) don't post the change, mark that the change needs to posted and wait for a the relevent timer to go off, you might get to avoid a bunch of short lived intermediary updates. You might also use some sort of proxy to aggregate related changes between multiple objects.
Merge observable properties: If you have a large number of properties in one type of object that might change you may be better off making a single "hasChanges" property observe and having the the observer query the properties.