How to improve mailbox scan time in Scala actors - scala

I created following example of actors in Scala: http://pastebin.com/pa3WVpKy
Without throttling (reducing number of SendMoney messages) that occurs in lines:
val processed = iterations - counter.getCount/2
if (processed < i - banksCount * 5) Thread.sleep(1)
message processing in this test is very slow (especially when there are few bank actors).
That's because actors' mailboxes are full of SendMoney messages and receiving ReadAccountResponse messages takes a long time (they are usually almost at the end of mailbox, and whole mailbox must be scanned).
How to improve mailbox scan time in such cases?
Maybe there is a possibility to define some messages as high priority?
It would be great to have two mailboxes - one for usual messages and one for high priority ones. The high priority mailbox could be scanned first.
Also "reply" method could send messages to high priority mailbox automatically. Or maybe create two mailboxes - for usual messages and responses?
What's your oppinion?
Regards
Wojciech DurczyƄski

One potentially good solution to this problem will be Phillip Haller's translucent functions, in which the scala compiler reflectively exposes information about what kinds of objects a match expression can match. Then, actor mailboxes can be indexed by message class and lookup can potentially be drastically faster, especially in this kind of "needle in a haystack" scenario.
Here's the API for TransluncentFunction, as you can see it's pretty straightforward. It seems like the Translucent project has been on hiatus for awhile, let's hope it picks up again soon!

I believe that Lift's actors have exactly this prioritisation built-in: rather than overriding a single "act" method, there are a number of different methods (not sure of the exact names) which can be implemented dependent on the action's priority.
I'm not sure whether this solves the scanning slowdown issue though

Related

What's the max of topics I can have on ZeroMQ?

I'm new to ZeroMQ ( I've been using SQS so far ).
I would like to build a system where every time a user logs in, they subscribe to a queue. The all the users subscribed to this queue are interested only in messages directed to them.
I read about topic matching. It seems that I could create a pattern like this:
development.player.234345345
development.player.453423423
integration.player.345354664
And, each worker ( user ) can subscribe to the queue and listen only to the topic they match. i.e. a player 234345345 on the development environment will only subscribe to messages with the topic development.player.234345345
Is this true?
And if so, what are the consequences in ZeroMQ?
Is there a limit on how many topic matching I can have?
ZeroMQ has a very detailed page on how the internals of topic matching works. It looks like you can have as many topics as you want, but topic matching incurrs a runtime cost. It's supposed to be extremely fast:
We believe that application of the above algorithms can give a system
that will be able to match or filter a single message in the range of
nanoseconds or couple of microseconds even it the case of large amount
of different topics and subscriptions.
However, there are some caveats you need to be aware of:
The inverted bitmap technique thus works by pre-indexing a set of
searchable items so that a search request can be resolved with a
minimal number of operations.
It is efficient if and only if the set of searchable items is
relatively stable with respect to the number of search requests.
Otherwise the cost of re-indexing is excessive.
In short, as long as you don't change your subscriptions too often, you should be able to do on the order of thousands of topics at least.
A: Yes, you can
The Max. Number? A harder part...
May would like to read Martin SUSTRIK's post on this:
While ZeroMQ evolves on it's own, Martin, ZeroMQ co-father, has posted on this subject a few interesting facts here, with some further details and design view discussion derrogated here
Efficient Subscription Matching
In ZeroMQ, simple tries are used to store and match PUB/SUB subscriptions. The subscription mechanism was intended for up to 10,000 subscriptions where simple trie works well. However, there are users who use as much as 150,000,000 subscriptions. In such cases there's a need for a more efficient data structure.
Worth reading to have some estimate of where safe-zones are.
Also worth to know, that not all ZeroMQ versions behave the same way.
Recent API uses PUB-side topic filtering, which is not automatic for all previous versions, where SUB-side filtering was used. Translate that into all the network transport, if all messages, irrespective of their's final destiny are broadcast to all SUB-s, just to realise that only one ( user in your use-case ) will match and all the rest will discard the messages, due to topic-filter mismatches.
Thus all your use-cases ought take into account what different ZeroMQ versions ( incl. different non-native language bindings and wrappers ) may
meet and cooperate on the same playground.
Anyway, ZeroMQ is a great tool, nanomsg being in recent years also worth to monitor and challenge.

Concurrency, how to create an efficient actor setup?

Alright so I have never done intense concurrent operations like this before, theres three main parts to this algorithm.
This all starts with a Vector of around 1 Million items.
Each item gets processed in 3 main stages.
Task 1: Make an HTTP Request, Convert received data into a map of around 50 entries.
Task 2: Receive the map and do some computations to generate a class instance based off the info found in the map.
Task 3: Receive the class and generate/add to multiple output files.
I initially started out by concurrently running task 1 with 64K entries across 64 threads (1024 entries per thread.). Generating threads in a for loop.
This worked well and was relatively fast, but I keep hearing about actors and how they are heaps better than basic Java threads/Thread pools. I've created a few actors etc. But don't know where to go from here.
Basically:
1. Are actors the right way to achieve fast concurrency for this specific set of tasks. Or is there another way I should go about it.
2. How do you know how many threads/actors are too many, specifically in task one, how do you know what the limit is on number of simultaneous connections is (Im on mac). Is there a golden rue to follow? How many threads vs how large per thread pool? And the actor equivalents?
3. Is there any code I can look at that implements actors for a similar fashion? All the code Im seeing is either getting an actor to print hello world, or super complex stuff.
1) Actors are a good choice to design complex interactions between components since they resemble "real life" a lot. You can see them as different people sending each other requests, it is very natural to model interactions. However, they are most powerful when you want to manage changing state in your application, which does not seem to be the case for you. You can achieve fast concurrency without actors. Up to you.
2) If none of your operations is blocking the best rule is amount of threads = amount of CPUs. If you use a non blocking HTTP client, and NIO when writing your output files then you should be fully non-blocking on IOs and can just safely set the thread count for your app to the CPU count on your machine.
3) The documentation on http://akka.io is very very good and comprehensive. If you have no clue how to use the actor model I would recommend getting a book - not necessarily about Akka.
1) It sounds like most of your steps aren't stateful, in which case actors add complication for no real benefit. If you need to coordinate multiple tasks in a mutable way (e.g. for generating the output files) then actors are a good fit for that piece. But the HTTP fetches should probably just be calls to some nonblocking HTTP library (e.g. spray-client - which will in fact use actors "under the hood", but in a way that doesn't expose the statefulness to you).
2) With blocking threads you pretty much have to experiment and see how many you can run without consuming too many resources. Worry about how many simultaneous connections the remote system can handle rather than hitting any "connection limits" on your own machine (it's possible you'll hit the file descriptor limit but if so best practice is just to increase it). Once you figure that out, there's no value in having more threads than the number of simultaneous connections you want to make.
As others have said, with nonblocking everything you should probably just have a number of threads similar to the number of CPU cores (I've also heard "2x number of CPUs + 1", on the grounds that that ensures there will always be a thread available whenever a CPU is idle).
With actors I wouldn't worry about having too many. They're very lightweight.
If you have really no expierience in Akka try to start with something simple like doing a one-to-one actor-thread rewriting of your code. This will be easier to grasp how things work in akka.
Spin two actors at the begining one for receiving requests and one for writting to the output file. Then when request is received create an actor in request-receiver actor that will do the computation and send the result to the writting actor.

How to resequence after filtering for aggregation /Spring Integration/

I'm doing a project in Spring Integration and I have a big problem.
There are some filtering components in the flow and later in the flow I have an aggregation element.
The problem is that the filtering component does not support to "apply-sequence" property. It filters out some records without modifying the original sequence number however the number of messages are reduced.
Later in the flow I need an aggregation which fails releasing elements since some messages are filtered out.
I don't want to use any special routing elements which have apply-sequence property.
Can you suggest me any common solution for this type of filtering problem?
Thanks,
I'd say you misunderstand the behaviour of the filter and aggregator.
I guees you have some apply-sequence-aware component upstream. So, all messages in that group accept several headers - correlationId - to group messages in the default aggregator; sequenceNumber - the index of the message; sequenceSize - the number of messages in the group.
Filter just checks messages for some condition and sends them to the outpu-channel or does discard logic. It doesn't modify messages. However even if we could do that, it doesn't sounds good anyway.
Assume we have just only two messages in the group. The first on is OK for filtering - we just send it to the aggregator. But the second is discarded, and, yes, it won't be sent to aggregator. And the last one never releases that group, because the sequenceSize isn't reached.
To overcome your requirement you need to have some custom ReleaseStrategy on the aggregator (by default it is SequenceSizeReleaseStrategy). For example to check some state in your system that all messages in the group have been sent independently of true or false result after filter. Or have some fake message for the same reason and check its availability in the group.
In this case you will need just take care about correlationId to group messages in the aggregator.
UPDATE
What is the suggested release strategy for such a scenario? Would it be a good strategy to use timeout as release stretegy?
What I can say that sometimes it is really difficult to find good solution for some integration scenarios. The messaging is stateless by nature, so to correlate and group an undetermined number of messages may be a problem.
There is need to see requirements and environment.
For example when all your messages are processed in the single thread you can safely send some fake marker message in the end directly to the aggregator and check it from ReleaseStrategy. And it will work even when all your messages from the group may be discarded.
If you process those messages in parallel or they are received from different threads, you really won't be able to determine the order of messages and the time for each process.
In this case the TimeoutCountSequenceSizeReleaseStrategy really can help. Of course, there will be need to find the good timeframe compromise according to the requirements to your system.

How to handle bursts with Actors?

Suppose I have an actor, which handles X requests per second. It is ok in average but sometimes there are bursts and clients send Y > X requests per second. Suppose also that all requests have timeouts so they cannot wait in queue forever.
Assuming we program in Scala and Akka what are the best practices/design patterns to make the actor handle those bursts? Are there any code examples, which handle bursts?
As long as your machine can handle the increased load (i.e. has enough CPUs), then I would suggest pooling the Actor using a Router. It sounds like from your example, a dynamically resizing router might be the best fit, but even a standard Round Robin or Smallest Mailbox might be enough. Below is the link for the routers section from the Akka documentation. I hope this helps.
http://doc.akka.io/docs/akka/2.1.2/scala/routing.html
You could also consider distributing the actor across multiple nodes, but that might be overkill for your scenario. If you have interest in that approach, let me know and I can post more context on doing that.
Now as far as what to do when after you pool the actors but the system still is getting backlogged, that's really up to you, but here are a few options. If you can handle the occasional increases in latency due to bursting, then do nothing. The actors mailboxes will just get a little backed up but they will clear as soon as the burst eases off. If not, then the question is how to handle incoming messages when the actors are backlogged. If you want to fast fail in that situation and not accept the message you might want to look into using a bounded mailbox (http://doc.akka.io/docs/akka/2.1.2/scala/dispatchers.html). When the mailbox reaches it's size limit and can no longer queue messages, the caller will get a failure sending the message (I think). Not awesome, but at least will lead to the system stabilizing faster.
I assume you are doing ask (?) (i.e. request/response), so when you do that, you get a Future. That Future will time out (with an implicitly defined timeout value) if it does not receive a response in time, so during a burst, Futures attached to the calls into the backlogged actors will just start timing out; they will not be stuck there forever.

Scala, Actors, what happens to unread inbox messages?

What happens to unread inbox messages in Scala Actors? For example two cases:1.If forget to implement react case for special message: actor!NoReactCaseMessage2. If messages comes too fast: (timeOfProcessingMessage > timeOfMessageComes)
If first or second case happens, would it be stacked in memory?
EDIT 1 Is there any mechanism to see this type of memory leak is happening? Maybe, control number of unread messages then make some garbage collect or increase actor pool. How to get number of unread messages? How this sort of memory leaks solved in other languages? For example in Erlang?
The mailbox is a queue - if nothing is pulling the messages from the queue (i.e. if the partial function in your react or receive loop returns false for isDefinedAt), then the messages just stay there.
Strictly speaking this is a memory leak (of your application), although the seriousness of it is dependent on how these unread messages grow in number (obviously). For example, I often use actors to merge in a replay query and a "live stream" of messages both identified by a sequence number. My reaction looks like this:
var lastSeq = 0L
loop {
react {
case Msg(seq, data) if seq > lastSeq => lastSeq = seq; process(data)
}
}
This contains a memory leak, but not a "serious" one, as there will be an upper bound in the number of duplicate messages (i.e. there can be no more once the replay query is finished).
However, this may still be an annoyance, as for each reaction, the actor sub-system will scan those messages again to see whether they can be processed.
In fact, thinking about a real mailbox might be a good analogy here. Imagine you left all your junk mail in there: pretty soon, you'd be suffering starvation because of all the junk mail you would have to sift through in order to get to the credit card statement.
How this sort of memory leaks solved in other languages? For example in Erlang?
Same as in Scala. First issue:
If forget to implement react case for special message
You very rarely need to leave a message in the mailbox intentionally and receive it later. I haven't ever encountered such a situation. So you can always include a catch-all clause (case _ => ... in Scala), which will do nothing, log a warning, or throw an exception -- whatever makes most sense in your situation.
If messages comes too fast
Instead of sending messages directly to a process which can't handle them fast enough, add a buffering process. It can throw away extra messages, send them to more than one worker process, etc.