Suitable architecture for queuing and scheduling large jobs - scheduled-tasks

Currently I have many jobs written in java that pull data from various sources, these jobs run every 5 minutes on a virtual machine in crontab, they send the data to a Kafka queue , where this a consumer that pushes the data to the database,
This system is not very reliable as you usually need to open the virtual machine to stop and force jobs to start.
So I need a Job management software for queuing and scheduling heavy jobs that take several minutes to complete, preferably I want a user interface to be able to monitor the jobs easily and force start or force stop them at anytime, the architecture needs to be robust in 2 matters, the first is scheduling many jobs that takes a lot of time, the second matter is to handle job requests in a queue, as sometimes we need to pull data on request from other sources which will run for one time only.

Related

Scheduling jobs in Kafka

I am currently working on an application which will schedule a task as a timers. The timers can be run on any day of the week with the configuration by the user. Currently it is implemented with bullqueue and redis for storage. Once the timer will execute it will execute an event and further process the business logic. There can be thousands of queue messages in the redis.
I am looking to replace redis with Kafa as I have read it is easy to scale and guarantee of no message loss.
The question is. Is it a good idea to go with Kafa? If yes then how can we schedule a jobs in kafka with the combination of bullqueue. I am new to Kafka and still trying to understand how can we schedule the jobs in Kafka or is it a good architecture setup to go with.
My current application setup is with nestjs, nodejs
Kafka doesn't have any feature like this built-in, so you'd need to combine it with some other timer/queue system for scheduling a KafkaProducer action.
Similarly, Kafka Consumers are typically always running, although, you can start/pause them periodically as well.

Distributed systems with large number of different types of jobs

I want to create a distributed system that can support around 10,000 different types of jobs. One single machine can host only 500 such jobs, as each job needs some data to be pre-loaded into memory, which can't be kept in a cache. Each job must have redundancy for availability.
I had explored open-source libraries like zookeeper, hadoop, but none solves my problem.
The easiest solution that I can think of, is to maintain a map of job type, with its hosted machine. But how can I support dynamic allocation of job type on my fleet? How to handle machine failures, to make sure that each job type must be available on atleast 1 machine, at any point of time.
Based on the answers that you mentioned in the comments, I propose you to go for a MQ-based (Message Queue) architecture. What I propose in this answer is to:
Get the input from users and push them into a distributed message queue. It means that you should set up a message queue (Such as ActiveMQ or RabbitMQ) on several servers. This MQ technology, helps you to replicate the input requests for fault tolerance issues. It also provides a full end-to-end asynchronous system.
After preparing this MQ layer, you can setup you computing servers layers. This means that some computing servers (~20 servers in your case) will read the requests from the message queue and start a job based on the request. Because this MQ is distributed, you can make sure that a good level of load balancing can happen in your computing servers. In addition, each server is capable of running as much as jobs that you want (~500 in your case) based on the requests that it reads from the MQ.
Regarding the failures, the computing servers may only pop from the MQ, if and only if the job is completed. If one server is crashing, the job is still in the MQ and another server can work on it. If the job is saving some state somewhere or updates something, you should manage its duplicate run then.
The good point about this approach is that it is very salable. It means that if in future you have more jobs to handle, by adding a computing server and connecting it to the MQ, you can process more requests on the servers without any change to the system. In addition, some nice features in the MQ like priority-based queuing, helps you to prioritize the requests and process them based on the job type.
p.s. Your Q does not provide any details about the type and parameters of the system. This is a draft solution that I can propose. If you provide more details, maybe the community can help you more.

Does Celery task code need to exist on multiple machines?

Trying to wrap my head around Celery. So far I understand that there is the Client, Broker, and Worker(s). For simplicity, I'm looking at a Dockerized version of this setup.
If I understand correctly, the client enqueues a task on the broker, then the worker continuously attempts to pop from the broker and process the task.
In the example, it seems like both the Client (in this case a Flask app) and the Worker, reference the exact same code. Is that correct?
Does this mean that if each of the components were broken out into their own machines that the code would need to be deployed to both Client and Worker machines at the same time? It seems strange that these pieces would need to access the same task/function to do their respective work.
This is one thing I was initially confused by as well. The tasks' code doesn't have to be present on both, only the worker needs the code to do the actual work. When you say that client enqueues the task on broker which worker then executes, it's crucial to understand how this works. The client only sends a message to the broker, not an actual task. The message contains task name, arguments and other stuff. So the client needs to know just these parameters about the task to enqueue it. Then, it can use send_task to enqueue the task without knowing.
This is how I employ Celery in a simple jobber application where I want to decouple the pieces as much as possible. I have a Flask app that serves as a UI for the jobs which users can manage (create, see the state/progress etc.) The Flask app uses APScheduler to actually run the jobs (where a job is nothing else than a Celery task). Now, I want the client part (Flask app + scheduler) to know only as little as possible about the tasks to run them as jobs. That means their names and arguments they can take. To make it really independent of the tasks' code, I get this information from the workers via the broker. You can see a little bit more background from this issue I initially created.

How to message/work queues work and what's the justification for a dedicated/hosted queue service

I'm getting into utilising work queues for my app and can easily understand their benefit: computationally or time-intensive tasks (like sending an email) can be put in a queue and run asynchronously to the request that initiated the task so that it can get on with responding to the user.
For example in my Laravel app I am using redis queueing. I push a job into onto the queue and a separate artisan process which is running on the server (artisan queue:listen) is listening on the queue for new jobs for it to execute.
My confusion is when it comes to the actually queuing system. Surely the queue worker is just a list of jobs with any pertinent data serialised and passed through. The job itself is still computed by the application.
So this leads me to wonder about the benefit and cost of large-scale queue workers like Iron.io or Amazon SQS. These services cost a fair amount for what seems like a fairly straightforward and computationally minimal task. Even with thousands of jobs a minute a simple redis or beanstalkd queue on the same server will surely be handled far easier than the actual jobs themselves. Having a hosted system seems like it'll slow down the process more due to the latency between servers.
Can anyone explain the workings of a work queue and how these hosted services improve the performance of such an application. I imagine the benefit is only felt once an app has scaled sufficiently but more insight would be helpful.

Programming language for developing multi platform daemon

I am developing an application that needs to perform lots of batch processing (for example whois request). To ensure best performance, I would like to split the job between different computers. For this, I am planning to write a program which will query the job queue on the main server, fetch one job, processes it and updates the main server with the result. The actual processing will be done by PHP. I program only need to poll the job queue and invoke the local php script. The job needs to run every few seconds, hence I cannot use cron.
Can anyone suggest a programming language that can create such a daemon easily? Is there any program available that already does this?
Thanks
A few notes
You can use cron to run every few seconds(using hacks though)
You will need some sort of distributed queue to hold your jobs (RabbitMQ is a good one or you can use ZooKeeper)
Depending on the queue you pick, there are API's in many programming languages to remove jobs from the queue.
There are many open source tools that will do similar things, but it will greatly depend on how sophisticated your needs are.
Hadoop is a complex product that will let you easily implement this
workerpool is a python library that is simple to use, but this is multithreaded and will run a single machine. So it is on the simple end of the spectrum.