How to message/work queues work and what's the justification for a dedicated/hosted queue service - queue

I'm getting into utilising work queues for my app and can easily understand their benefit: computationally or time-intensive tasks (like sending an email) can be put in a queue and run asynchronously to the request that initiated the task so that it can get on with responding to the user.
For example in my Laravel app I am using redis queueing. I push a job into onto the queue and a separate artisan process which is running on the server (artisan queue:listen) is listening on the queue for new jobs for it to execute.
My confusion is when it comes to the actually queuing system. Surely the queue worker is just a list of jobs with any pertinent data serialised and passed through. The job itself is still computed by the application.
So this leads me to wonder about the benefit and cost of large-scale queue workers like Iron.io or Amazon SQS. These services cost a fair amount for what seems like a fairly straightforward and computationally minimal task. Even with thousands of jobs a minute a simple redis or beanstalkd queue on the same server will surely be handled far easier than the actual jobs themselves. Having a hosted system seems like it'll slow down the process more due to the latency between servers.
Can anyone explain the workings of a work queue and how these hosted services improve the performance of such an application. I imagine the benefit is only felt once an app has scaled sufficiently but more insight would be helpful.

Related

Suitable architecture for queuing and scheduling large jobs

Currently I have many jobs written in java that pull data from various sources, these jobs run every 5 minutes on a virtual machine in crontab, they send the data to a Kafka queue , where this a consumer that pushes the data to the database,
This system is not very reliable as you usually need to open the virtual machine to stop and force jobs to start.
So I need a Job management software for queuing and scheduling heavy jobs that take several minutes to complete, preferably I want a user interface to be able to monitor the jobs easily and force start or force stop them at anytime, the architecture needs to be robust in 2 matters, the first is scheduling many jobs that takes a lot of time, the second matter is to handle job requests in a queue, as sometimes we need to pull data on request from other sources which will run for one time only.

Distributed systems with large number of different types of jobs

I want to create a distributed system that can support around 10,000 different types of jobs. One single machine can host only 500 such jobs, as each job needs some data to be pre-loaded into memory, which can't be kept in a cache. Each job must have redundancy for availability.
I had explored open-source libraries like zookeeper, hadoop, but none solves my problem.
The easiest solution that I can think of, is to maintain a map of job type, with its hosted machine. But how can I support dynamic allocation of job type on my fleet? How to handle machine failures, to make sure that each job type must be available on atleast 1 machine, at any point of time.
Based on the answers that you mentioned in the comments, I propose you to go for a MQ-based (Message Queue) architecture. What I propose in this answer is to:
Get the input from users and push them into a distributed message queue. It means that you should set up a message queue (Such as ActiveMQ or RabbitMQ) on several servers. This MQ technology, helps you to replicate the input requests for fault tolerance issues. It also provides a full end-to-end asynchronous system.
After preparing this MQ layer, you can setup you computing servers layers. This means that some computing servers (~20 servers in your case) will read the requests from the message queue and start a job based on the request. Because this MQ is distributed, you can make sure that a good level of load balancing can happen in your computing servers. In addition, each server is capable of running as much as jobs that you want (~500 in your case) based on the requests that it reads from the MQ.
Regarding the failures, the computing servers may only pop from the MQ, if and only if the job is completed. If one server is crashing, the job is still in the MQ and another server can work on it. If the job is saving some state somewhere or updates something, you should manage its duplicate run then.
The good point about this approach is that it is very salable. It means that if in future you have more jobs to handle, by adding a computing server and connecting it to the MQ, you can process more requests on the servers without any change to the system. In addition, some nice features in the MQ like priority-based queuing, helps you to prioritize the requests and process them based on the job type.
p.s. Your Q does not provide any details about the type and parameters of the system. This is a draft solution that I can propose. If you provide more details, maybe the community can help you more.

Parallelism behaviour of stream processing engines

I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)
However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.
Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.
Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.
My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.
Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.
If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.
you are right to consider that sending data across the network will consume more time of the total processing time.
However, these frameworks (Storm, Spark, Samza, Flink) were created to process a lot of data that potentially does not fit in memory of one computer. So, if we use more than one computer to process the data we can achieve parallelism.
And, following your question about the network latency. Yes! this is a trade off to consider. The developer has to know that they are implementing programs to deploy in a parallel framework. The way that they build the application will influence how much data is transferred through the network as well.

Akka - Pulling Pattern vs Durable Mailboxes

I've been working on a project of mine using Akka to create a real-time processing system which takes in the Twitter stream (for now) and uses actors to process said messages in various ways. I've been reading about similar architectures that others have built using Akka and this particular blog post caught my eye:
http://blog.goconspire.com/post/64901258135/akka-at-conspire-part-5-the-importance-of-pulling
Here they explain different issues that arise when pushing work (ie. messages) to actors vs. having the actors pull work. To paraphrase the article, by pushing messages there is no built-in way to know which units of work were received by which worker, nor can that be reliably tracked. In addition, if a worker suddenly receives a large number of messages where each message is quite large you might end up overwhelmed and the machine could run out of memory. Or, if the processing is CPU intensive you could render your node unresponsive due to CPU thrashing. Furthermore, if the jvm crashes, you will lose all the messages that the actor(s) had in its mailbox.
Pulling messages largely eliminates these problems. Since a specific actor must pull work from a coordinator, the coordinator always knows which unit of work each worker has; if a worker dies, the coordinator knows which unit of work to re-process. Messages also don’t sit in the workers’ mailboxes (since it's pulling a single message and processing it before pulling another one) so the loss of those mailboxes if the actor crashes isn't an issue. Furthermore, since each worker will only request more work once it completes its current task, there are no concerns about a worker receiving or starting more work than it can handle concurrently. Obviously there are also issues with this solution like what happens when the coordinator itself crashes but for now let's assume this is a non-issue. More about this pulling pattern can also be found at the "Let It Crash" website which the blog references:
http://letitcrash.com/post/29044669086/balancing-workload-across-nodes-with-akka-2
This got me thinking about a possible alternative to doing this pulling pattern which is to do pushing but with durable mailboxes. An example I was thinking of was implementing a mailbox that used RabbitMQ (other data stores like Redis, MongoDB, Kafka, etc would also work here) and then having each router of actors (all of which would be used for the same purpose) share the same message queue (or the same DB/collection/etc...depending on the data store used). In other words each router would have its own queue in RabbitMQ serving as a mailbox. This way, if one of the routees goes down, those that are still up can simply keep retrieving from RabbitMQ without too much worry that the queue will overflow since they are no longer using typical in-memory mailboxes. Also since their mailbox isn't implemented in-memory, if a routee crashes, the most messages that it could lose would just be the single one it was processing before the crash. If the whole router goes down then you could expect RabbitMQ (or whatever data store is being used) to handle an increased load until the router is able to recover and start processing messages again.
In terms of durable mailboxes, it seems that back in version 2.0, Akka was gravitating towards supporting these more actively since they had implemented a few that could work with MongoDB, ZooKeeper, etc. However, it seems that for whatever reason they abandoned the idea at some point since the latest version (2.3.2 as of the writing of this post) makes no mention of them. You're still able to implement your own mailbox by implementing the MessageQueue interface which gives you methods like enqueue(), dequeue(), etc... so making one that works with RabbitMQ, MongoDB, Redis, etc wouldn't seem to be a problem.
Anyways, just wanted to get your guys' and gals' thoughts on this. Does this seem like a viable alternative to doing pulling?
This question also spawned a rather long and informative thread on akka-user. In summary it is best to explicitly manage the work items to be processed by a (persistent) actor from which a variable number of worker actors pull new jobs, since that allows better resource management and explicit control over what gets processed and how retries are handled.

Where to host a queue in a rackspace cloud deployment

I am working on an application that consists out of a asp.net mvc front-end which call a bunch of webservices and those call onto sql server. The actions on the front-end could lead to a very large amount of jobs that need execution which I want to queue somewhere.
Due to the expected load profile of the application it makes sense to use a scalable infrastructure like the rackspace cloud. Now I am wondering where it would be best to queue the jobs. Queueing them on the front-end server means that the number of front-end servers can only be scalled back down once the queues are processed which is a waste of resources if the peak load on the front-end is over we want to scale that down and scale up on machines that process the queue items.
If we queue them on the database server we are adding the load onto a single machine which in the current setup is already the most likely botleneck. How would you design this?
You should read up on event sourcing and CQRS - in particular Greg Youngs 6hr presentation (http://www.viddler.com/v/dc528842) - it's aim is to alleviate the burden of these sorts of issues in a tried and tested way.
hth