I have seen lots of chat examples in Erlang but what about lists, like a work queue? If I want to build a work queue system, like a project management system, is it possible to re-order messages in a process mailbox or do I have to use message priorities? Are there examples of workflow systems built in Erlang?
You cannot reorder messages in process message queues in Erlang.
You can, however do selective receives in which you can receive the message you deem most important first. It's not entirely the same but works for most purposes.
Here's an example:
receive
{important, Msg} ->
handle(Msg)
after 0 ->
ok
end,
receive
OtherMsg ->
handle(Msg)
end
It differs from:
receive
{important, Msg} ->
handle(Msg);
OtherMsg ->
handle(Msg)
end
In that it will always scan the whole message queue for {important, Msg} before continuing handling the rest of the messages. It means that those kinds of messages will always be handled before any others, if they exist. This of course comes at some performance cost (it takes more time scanning the whole queue twice).
Process mailboxes work quite well as-is for job queues.
Just have your messages include sufficient information so that selective receive patterns are easy to write, and you won't feel the need to re-order mailbox contents.
If you do need to reorder messages, you can follow the gatekeeper pattern: reify the mailbox as a separate process. When your original process is ready for another message, the gatekeeper can compute which message to forward, by any rule you choose.
Related
I have a Java web-service that I am going to reimplement from scratch in Scala. I have an actor-based design for the new code, with around 10-20 actors. One of the use-cases has a flow like this:
Actor A gets a message a, creates tens of b messages to be handled by Actor B (possibly multiple instances, for load balancing), producing multiple c messages for Actor C, and so on.
In the scenario above, one message a could lead to a few thousand messages being sent back and forth, but I don't expect more than a handful of a messages a day (yes, it is not a busy service at the moment).
I have the following requirements:
Messages should not be lost or repeated. I mean if the system is restarted in the middle of processing b messages, the unprocessed ones should be picked up after restart. On the other hand, the processed ones should not be taken again (these messages will in the end start some big computation, and repeating them is costly).
It should be easily extensible. I mean in the future, I may want to add some other components to the system that can read all the communication (or parts of it) and for example make a log of what has happened, or count how many b messages were processed, or do something new with the b messages (next to what is already happening), etc. Note that these "components" could be independent applications written in other languages.
I am new to message bus technologies, but from what I have read, these requirements sound to me like what "message buses" offer, like RabbitMQ, Kafka, Kestrel, but I also see that akka also offers some means for persistence.
My problem is, given the huge range of possibilities, I am lost which technology to use. I read that something like Kafka is probably an overkill for my application. But I am also not sure if akka persistence answers my two requirements (especially the extensibility).
My question is: Should I go for an enterprise message bus? Something like Kafka? Or something like akka persistence will do?
Or would it be just faster and more appropriate if I implement something myself (with support for, say, AMQP to allow extensibility)?
Of course, specific technology suggestions are also welcome if you know of something that fits this purpose.
A Message Bus (typically called Message Brokers) like RabbitMQ can handle "out of the box" all of the messaging mechanisms you describe in your question. Specifically:
RabbitMQ has the ability "Out of the Box":
To deliver messages without repeating the message.
To extend the system and add logging and have statistics like you describe.
The scenario is publisher/subscriber, and I am looking for a solution which can give the feasibility of sending one message generated by ONE producer across MULTIPLE consumers in real-time. the light weight this scenario can be handled by one solution, the better!
In case of AMQP servers I've only checked out Rabbitmq and using rabbitmq server for pub/sub pattern each consumer should declare an anonymous, private queue and bind it to an fanout exchange, so in case of thousand users consuming one message in real-time there will be thousands or so anonymous queue handling by rabbitmq.
But I really do not like the approach by the rabbitmq, It would be ideal if rabbitmq could handle this pub/sub scenario with one queue, one message , many consumers listening on one queue!
what I want to ask is which AMQP server or other type of solutions (anyone similar including XMPP servers or Apache Kafka or ...) handles the pub/sub pattern/scenario better and much more efficient than RabbitMQ with consuming (of course) less server resource?
preferences in order of interest:
in case of AMQP enabled server handling the pub/sub scenario with only ONE or LESS number of queues (as explained)
handling thousands of consumers in a light-weight manner, consuming less server resource comparing to other solutions in pub/sub pattern
clustering, tolerating failing of nodes
Many Language Bindings ( Python and Java at least)
easy to use and administer
I know my question may be VERY general but I like to hear the ideas and suggestions for the pub/sub case.
thanks.
In general, for RabbitMQ, if you put the user in the routing key, you should be able to use a single exchange and then a small number of queues (even a single one if you wanted, but you could divide them up by server or similar if that makes sense given your setup).
If you don't need guaranteed order (as one would for, say, guaranteeing that FK constraints wouldn't get hit for a sequence of changes to various SQL database tables), then there's no reason you can't have a bunch of consumers drawing from a single queue.
If you want a broadcast-message type of scenario, then that could perhaps be handled a bit differently. Instead of the single user in the routing key, which you could use for non-broadcast-type messages, have a special user type, say, __broadcast__, that no user could actually have, and have the users to broadcast to stored in the payload of the message along with the message itself.
Your message processing code could then take care of depositing that message in the database (or whatever the end destination is) across all of those users.
Edit in response to comment from OP:
So the routing key might look something like this message.[user] where [user] could be the actual user if it were a point-to-point message, and a special __broadcast__ user (or similar user name that an actual user would not be allowed to register) which would indicate a broadcast style message.
You could then place the users to which the message should be delivered in the payload of the message, and then that message content (which would also be in the payload) could be delivered to each user. The mechanism for doing that would depend on what your end destination is. i.e. do the messages end up getting stored in Postgres, or Mongo DB or similar?
I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.
So, i built this small example of a ZeroMQ pipeline architecture because i'll end up having to do something similar very soon and i'm trying to grasp the pipeline concept the right way.
https://gist.github.com/2765708
Right now, this is completely asynchronous. The controller dispatches a batch of tasks to various workers, which in their turn, send a message to the sink. The controller and sink are fixed parts of my architecture, while workers are dynamic. That's perfect.
However, i would like to know when the workers have finished working on all their tasks. In that example, i do know the amount of messages, but that won't be true on real-life situations. I might have 100 messages or 10,000. So, how can the sink or the controller know when the workers have finished working on their tasks? I have to perform some actions that depend on the conclusion of the jobs sent to workers.
I wanted to expand on #bjlaub's answer. It started as a comment but I was typing too much. I agree with the concept of acknowledgment, but believe it can originate in multiple places.
There are multiple approaches to this communication and it all depends on the behavior you are after in the system.
First, you can either send out messages from the workers as they finish each task, or from the sink as it receives each task. Right now I am not addressing the type of socket, only the act of communicating. I believe it is much more efficient to send it from the sink as you would only need one connection back to the controller instead of one for each worker. The sink does not need to know how many total tasks there are. Only that it is firing off a message after each result it receives. The controller can determine how many to expect since it was the submission point and new when it had exhausted its submission (the count).
Now regardless of whether you have the message sent from the worker or the sink, you can use different socket types. If you want the controller to completely block until all work is done, then you can have it be a push/pull until it receives X messages (message content can be anything. Its just a trigger).
This may be limiting if the controller wants to be able to do other work while these tasks are happening. If so, you could maybe use pub/sub, and let the controller subscribe to being notified as tasks complete, and asynchronously maintain a count until the total has been satisfied.
And finally, maybe you have the situation where you want the controller to ask the sink for a status when you deem fit. You can have a req/rep pattern for the controller to ask the sink how many requests it has received on demand.
I'm sure one of these patterns will fit your specific needs.
One idea (disclaimer: I have very little experience w/ 0MQ!):
Setup an "acknowledgment" pipeline in the reverse direction. Since the controller presumably knows how many tasks it has dispatched to the workers (e.g. the number of times it called send), it can use a PULL socket to receive a small message (an integer for example) from each worker indicating the completion of the task. The worker process dispatches its completed result to the sink, and at the same time sends the acknowledgement back to the controller. Once the controller collects the right number of acknowledgements, it can do whatever post-processing is necessary before farming out the next set of work.
You could also push this downstream to the sink, but you would need to notify the sink of the total number of work units to expect before farming them out to the workers.
What happens to unread inbox messages in Scala Actors? For example two cases:1.If forget to implement react case for special message: actor!NoReactCaseMessage2. If messages comes too fast: (timeOfProcessingMessage > timeOfMessageComes)
If first or second case happens, would it be stacked in memory?
EDIT 1 Is there any mechanism to see this type of memory leak is happening? Maybe, control number of unread messages then make some garbage collect or increase actor pool. How to get number of unread messages? How this sort of memory leaks solved in other languages? For example in Erlang?
The mailbox is a queue - if nothing is pulling the messages from the queue (i.e. if the partial function in your react or receive loop returns false for isDefinedAt), then the messages just stay there.
Strictly speaking this is a memory leak (of your application), although the seriousness of it is dependent on how these unread messages grow in number (obviously). For example, I often use actors to merge in a replay query and a "live stream" of messages both identified by a sequence number. My reaction looks like this:
var lastSeq = 0L
loop {
react {
case Msg(seq, data) if seq > lastSeq => lastSeq = seq; process(data)
}
}
This contains a memory leak, but not a "serious" one, as there will be an upper bound in the number of duplicate messages (i.e. there can be no more once the replay query is finished).
However, this may still be an annoyance, as for each reaction, the actor sub-system will scan those messages again to see whether they can be processed.
In fact, thinking about a real mailbox might be a good analogy here. Imagine you left all your junk mail in there: pretty soon, you'd be suffering starvation because of all the junk mail you would have to sift through in order to get to the credit card statement.
How this sort of memory leaks solved in other languages? For example in Erlang?
Same as in Scala. First issue:
If forget to implement react case for special message
You very rarely need to leave a message in the mailbox intentionally and receive it later. I haven't ever encountered such a situation. So you can always include a catch-all clause (case _ => ... in Scala), which will do nothing, log a warning, or throw an exception -- whatever makes most sense in your situation.
If messages comes too fast
Instead of sending messages directly to a process which can't handle them fast enough, add a buffering process. It can throw away extra messages, send them to more than one worker process, etc.