How a job queueing software works - worker

I have a small website where and i am allowing people to sign up for accounts.I want after someone has signed up,he/she is sent a welcome email and a confirmation email.I want to attempt to do this using a job queuing system.
I have never used a job queuing system before but this is what i want to do.
I want to have some jobs to have priority over others.For instance,if someone is from my area code,i want the confirmation email to be sent asap or immediately while if a user is from abroad,i want to send the conformation email after some minutes,say five.
My question is on how the actual jobs are processed.Can workers be made to process some jobs immediately if a job has priority over others and some other workers wait five minutes or more before the job is started.
Second question,how do job queuing system know there are some jobs in the queue?. Does the queuing system poll the queue after every set time,i.e 5 minutes or is there some mechanism that keep it notified of new arrived jobs?.

Related

non-stop workers in celery

I'm creating a distributed web crawler that crawls multiple social media at the same time. This system is designed to distribute the available resources to the different social media based on their current post rate.
For example, if social media 1 has 10 new posts per hour and social media 2 has 5 posts per hour, 2 crawlers focus on social media 1 and 1 crawler focus on social media 2 (if we are allowed to have just three crawlers).
I have decided to implement this project via Celery, Flask, rabbitMQ, and Kubernetes as the resource manager.
I have some questions regarding the deployment:
How can I tell celery to keep a fixed number of tasks in rabbitMQ? This crawler should never stop crawling and should create a new task based on the social media's post rates (which is gathered from the previous crawling data), but the problem is, I don't have a task submitter for this process. Usually, there is a task submitter for celery that submits tasks, but there is no such thing as a task submitter in this project. We have a list of social media and the number of workers they need (stored in Postgres) and need celery to put a task in rabbitMQ as soon as a task is finished.
I have tried the solution to submit a task at the end of every job (Crawling Process), but this approach has a problem and is not scalable. In this case, the submitted job would be the last in the rabbitMQ queue.
I need a system to manage the free workers and assign tasks to them immediately. The system I want should check the free and busy workers and database post rates and give a task to the worker. I think using rabbitMQ (or even Redis) might not be good because they are message brokers which assign a worker to a task in the queue but here, I don't want to have a queue; I want to start a task immediately when a free worker is found. The main reason queueing is not good is that the task should be decided when the job is starting, not before that.
My insights on your problem.
I need a system to manage the free workers and assign tasks to them
immediately.
-- Celery does this job for you
The system I want should check the free and busy workers and database
post rates and give a task to the worker.
Celery is a task distribution system, it will distribute the tasks as you expect
I think using rabbitMQ (or even Redis) might not be good because they
are message brokers which assign a worker to a task in the queue
Using celery, you definitely need a broker, they just hold your messages, celery will poll the queues and distribute them to the right workers(priority, timeout, soft handling, retries)
but here, I don't want to have a queue; I want to start a task
immediately when a free worker is found. The main reason queueing is
not good is that the task should be decided when the job is starting,
not before that.
This is kind of a chain reaction or like triggering a new job once the previous one is done. If this is the case, you don't even need celery or a distributed producer-consumer system.
Identify the problem:
Do you need a periodic task to be executed at a point in time? ---> go with a cronjob or celery-beat(cron job-based celery scheduler)
Do you require multiple tasks to be executed without blocking the other running tasks - You need a producer-consumer system(Celery(out of the box solution, Rabbitmq/Redis Native Python Consumers))
3.If the same task should be triggering the new task, there is no need to have multiple workers, what will we achieve from having multiple workers if your work is just a single thread.
Outcome -- [Celery, RabbitMQ, and Kubernetes - Good combo for a distributed orchestrated system] or [a webhook model] or [recursive python script]
Reply to your below comment #alavi
One way of doing it can be like, write a periodic job(can run every
second/minute or an hour or whatever rate) using celery beat, which
will act as a producer or parent task. It can iterate all media sites
from DB and spawn a new task for crawling. The same work status can be
maintained in DB, based on the status, new tasks can be spawn. For a
start I can say like this parenting task will check if the previous
job is still running, or check the progress of the last task, based on
the progress decide upon, even we can think about splitting the crawl
job again into micro tasks and being triggered from the parent job.
You can collect some more x and y going further during development or
with performance.

Architecting a configurable user notification service

I am building an application which needs to send notifications to users at a fixed time of day. Users can choose which time of day they would like to be notified, and which days they would like to be notified. For example, a user might like to be notified at 6am every day, or 7am only on week days.
On the back-end, I am unsure how to architect the service that sends these notifications. The solution needs to handle:
concurrency, so I can scale my servers (notifications should not be duplicated)
system restarts
if a user changes their preferences, pending notifications should be rescheduled
Using a message broker such as RabbitMQ and task scheduler such as Celery may meet your requirements.
Asynchronous, or non-blocking, processing is a method of separating the execution of certain tasks from the main flow of a program. This provides you with several advantages, including allowing your user-facing code to run without interruption.
Message passing is a method which program components can use to communicate and exchange information. It can be implemented synchronously or asynchronously and can allow discrete processes to communicate without problems. Message passing is often implemented as an alternative to traditional databases for this type of usage because message queues often implement additional features, provide increased performance, and can reside completely in-memory.
Celery is a task queue that is built on an asynchronous message passing system. It can be used as a bucket where programming tasks can be dumped. The program that passed the task can continue to execute and function responsively, and then later on, it can poll celery to see if the computation is complete and retrieve the data.
While celery is written in Python, its protocol can be implemented in any language. worker is an implementation of Celery in Python. If the language has an AMQP client, there shouldn’t be much work to create a worker in your language. A Celery worker is just a program connecting to the broker to process messages.
Also, there’s another way to be language independent, and that’s to use REST tasks, instead of your tasks being functions, they’re URLs. With this information you can even create simple web servers that enable preloading of code. Simply expose an endpoint that performs an operation, and create a task that just performs an HTTP request to that endpoint.
Here it is the python example from official documentation:
from celery import Celery
from celery.schedules import crontab
app = Celery()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
# Calls test('hello') every 10 seconds.
sender.add_periodic_task(10.0, test.s('hello'), name='add every 10')
# Calls test('world') every 30 seconds
sender.add_periodic_task(30.0, test.s('world'), expires=10)
# Executes every Monday morning at 7:30 a.m.
sender.add_periodic_task(
crontab(hour=7, minute=30, day_of_week=1),
test.s('Happy Mondays!'),
)
#app.task
def test(arg):
print(arg)
As I can see you need to have 3 types of entities: users (to store email or some other way to reach the user), notifications (to store what you want to send to user - text etc) and schedules (to store when user want to get notifications). You need to store entities of those types in some kind of database.
Schedule should be connected to user, notification should be connected to user and schedule.
Assume you have cron job that starts some script every minute. This script will try to get all notifications connected with schedule for current time (job starting time). Don't forget to implement some type of overlaping prevention.
After this script will place a tasks (with all needed data: type of notification, users who you want to notify etc) in queue (beanstalkd or something). You can create as many workers (even on different physical instances) as you want to serve this queue (without thinking about duplication) - this will give you a great power of scalability.
In case user changed his schedule it will affect all his notification at the same moment. There is no pending notification as they will be served only when they really should be send.
This is a very highlevel description. Many things depends on language, database(s), queue server, wokers implementation.

Celery: Make sure workers are not running only jobs from one user

I have 4 celery workers each with concurrency of 6.
I have users submitting varying number of jobs (from 1 to 20).
How do I ensure that each user's job get equal processing time, and that one user's job do not fill up the queue forcing other user's jobs to wait.
I am afraid if the workers are ending up going through all the jobs submitted by the first user, the other user's queued jobs must wait first user to finish, an inconvenience.
Is there a way to make the celery workers aware of one user's jobs holding up other user's queued jobs . Instead can I run maximum one job from each user at any given time?
I have one queue which I submit all the user's jobs to, would I need to make a queue for each user and somehow have round-robin strategy to pull one job from each user's queue?
At the moment Celery doesn't support priority queues.
Making a queue for each user and scheduling them based on round-robin algorithm seems to be a lot of work.
One simple way to solve your problem is to create a temporary table & store the incoming task details. Send first received task to celery. By the time it's completed, you might have received a lot of tasks from various users. Now based on the user id, completed tasks & uncompleted tasks, you can send most appropriate task to celery for execution.

Sending Reminders for Tasks

I have recently been thinking about possible architecture for a simple task reminder system. User will schedule a task and reminder in form of SMS/email/android needs to be sent to all stakeholders at some x minutes before the task is scheduled to be performed(much in the same way google calendar works). The problem here is to send the reminder at that precise point in time. Here are the two possible approaches I can think of:
Cron: I can setup a cron to run every minute. This will scan the table for notifications which need to be sent in the next minute and simply sends the notifications. But, precision is lost as there is always the chance of that +/-1 min error.
Work Queues: I can simply put a message with appropriate delay in a queue at the time task was scheduled. Workers will send the notification as and when they receive the message. I can add as many workers as I want in case my real time behavior starts getting affected because of load. There are still a few issues. How to choose the appropriate work queue? I have evaluated RabbitMq and Beanstalk. While Rabbitmq follows standard AMQP protocol and is widely suggested, it doesn't provide the delay functionality out of the box. There are ways to simulate this using dead-letter-exchanges but this will not work in my case because the delay needs to be variable. Beanstalk supports this but the problem is that beanstalk queue resides entirely in memory which I don't like(but can live with). Any possible alternatives?
Third Approach: ??????. I am sure a simple desktop notification tool does neither of the two. What technology do they use to achieve the same thing?
We had the same scenario and we use Redis for long schedules even now reminders for up to 2 years. You can use Sorted Set where the timestamp is the score.
We use Beanstalkd delay jobs for those kind of reminders where we know it's relatively short term couple of hours, and there is no cancellations, as removing from beanstalkd a delayed message you need to retain the job id in a database for later removal, and that is no viable.
Although you mention memory limit, we use persistence on both Redis/Beanstalkd

Time based Sagas with Event Sourcing

Let's say I wanted to have a saga that get's created by some event, then sits and wait for a few hours, and if nothing happens, sends off some command.
Now, if this Saga was all in-memory and I had to restart the app/server, the saga would be unloaded and never seen again, right?
Would I use Event Sourcing to bring this Saga up to speed once the system is back online?
If so, I would need pretty much a separate Event Store with "active sagas" that can be replayed at system startup, to get my Sagas up to speed. So far it seems good to me, but how would I implement the timeout?
I would need some way of "faking" the timeouts at replay, taking into account there may be several, subsequent timeouts depending on the events going into the saga.
The best way to achieve this capability is with another endpoint that is capable of returning a message back to you at a certain point in time. For example, your saga may dispatch a message to this "timeout manager" and say wake me in 1 hour or 1 day or even 1 year. The message would then be returned to you at that time. Ideally this message would have business meaning that would cause an action to occur.
Perhaps the best example of this is something like customer signup where, if the customer hasn't confirmed their account within 7 days from signup, you'd notify them via email. The "timeout message" would effectively be: RemindUserToConfirmAccountMessage. When this message is received back by the saga after 7 days, the saga would determine based upon its current state, if that message needs to be handled and a customer email needs to be sent. But if the user has already confirm his/her account, the message can be discarded with no action taken.