non-stop workers in celery - kubernetes

I'm creating a distributed web crawler that crawls multiple social media at the same time. This system is designed to distribute the available resources to the different social media based on their current post rate.
For example, if social media 1 has 10 new posts per hour and social media 2 has 5 posts per hour, 2 crawlers focus on social media 1 and 1 crawler focus on social media 2 (if we are allowed to have just three crawlers).
I have decided to implement this project via Celery, Flask, rabbitMQ, and Kubernetes as the resource manager.
I have some questions regarding the deployment:
How can I tell celery to keep a fixed number of tasks in rabbitMQ? This crawler should never stop crawling and should create a new task based on the social media's post rates (which is gathered from the previous crawling data), but the problem is, I don't have a task submitter for this process. Usually, there is a task submitter for celery that submits tasks, but there is no such thing as a task submitter in this project. We have a list of social media and the number of workers they need (stored in Postgres) and need celery to put a task in rabbitMQ as soon as a task is finished.
I have tried the solution to submit a task at the end of every job (Crawling Process), but this approach has a problem and is not scalable. In this case, the submitted job would be the last in the rabbitMQ queue.
I need a system to manage the free workers and assign tasks to them immediately. The system I want should check the free and busy workers and database post rates and give a task to the worker. I think using rabbitMQ (or even Redis) might not be good because they are message brokers which assign a worker to a task in the queue but here, I don't want to have a queue; I want to start a task immediately when a free worker is found. The main reason queueing is not good is that the task should be decided when the job is starting, not before that.

My insights on your problem.
I need a system to manage the free workers and assign tasks to them
immediately.
-- Celery does this job for you
The system I want should check the free and busy workers and database
post rates and give a task to the worker.
Celery is a task distribution system, it will distribute the tasks as you expect
I think using rabbitMQ (or even Redis) might not be good because they
are message brokers which assign a worker to a task in the queue
Using celery, you definitely need a broker, they just hold your messages, celery will poll the queues and distribute them to the right workers(priority, timeout, soft handling, retries)
but here, I don't want to have a queue; I want to start a task
immediately when a free worker is found. The main reason queueing is
not good is that the task should be decided when the job is starting,
not before that.
This is kind of a chain reaction or like triggering a new job once the previous one is done. If this is the case, you don't even need celery or a distributed producer-consumer system.
Identify the problem:
Do you need a periodic task to be executed at a point in time? ---> go with a cronjob or celery-beat(cron job-based celery scheduler)
Do you require multiple tasks to be executed without blocking the other running tasks - You need a producer-consumer system(Celery(out of the box solution, Rabbitmq/Redis Native Python Consumers))
3.If the same task should be triggering the new task, there is no need to have multiple workers, what will we achieve from having multiple workers if your work is just a single thread.
Outcome -- [Celery, RabbitMQ, and Kubernetes - Good combo for a distributed orchestrated system] or [a webhook model] or [recursive python script]
Reply to your below comment #alavi
One way of doing it can be like, write a periodic job(can run every
second/minute or an hour or whatever rate) using celery beat, which
will act as a producer or parent task. It can iterate all media sites
from DB and spawn a new task for crawling. The same work status can be
maintained in DB, based on the status, new tasks can be spawn. For a
start I can say like this parenting task will check if the previous
job is still running, or check the progress of the last task, based on
the progress decide upon, even we can think about splitting the crawl
job again into micro tasks and being triggered from the parent job.
You can collect some more x and y going further during development or
with performance.

Related

AWS ECS. How to ensure only one instance of a task is running?

I'm wanting to setup an ECS task to schedule various other application tasks.
The "tasks" this task will schedule will mostly involve calling restful endpoints in another load balanced service.
I know there are other ways to do this, using cloudwatch to trigger a lambda etc. However this seems overly complex for what I need.
I was planning to just make a very simple, light-weight apline based image with a crontab to do the triggering of the restful calls.
This all seems easy enough. The only concern I have is that I would want to prevent, as far as possible, having multiple instances of this task running, even if only for a short period of time.
If my CI/CD pipeline triggers an update to this cron task, then there may be a short period of time, where the old and new task will be running simultaneously.
There may therefore be a small chance that a cron task could be triggered twice.
What I would like to do, is to have ECS stop the currently running task completely, before attempting to start the new one.
This seems to be contrary to the normal way it wants to work, where it will ensure the new task is up, and healthy before stopping the old one.
Is this possible, and if so, how do I configure it?
It's not a problem if my crons don't run for a period of time, but it could be a problem if any get triggered more than once.
Instead of using ECS Service (which makes sure a particular number of tasks is always running and deploys via rolling or B/G deploy strategy which is not you desire) - how about using StopTask and RunTask api to control when a task is stopped and started - gives you complete control.
Instead of using scheduled tasks, you could create an ECS service and use scheduled scaling to scale the desired service count to 1 and back down to zero.

Architecting a configurable user notification service

I am building an application which needs to send notifications to users at a fixed time of day. Users can choose which time of day they would like to be notified, and which days they would like to be notified. For example, a user might like to be notified at 6am every day, or 7am only on week days.
On the back-end, I am unsure how to architect the service that sends these notifications. The solution needs to handle:
concurrency, so I can scale my servers (notifications should not be duplicated)
system restarts
if a user changes their preferences, pending notifications should be rescheduled
Using a message broker such as RabbitMQ and task scheduler such as Celery may meet your requirements.
Asynchronous, or non-blocking, processing is a method of separating the execution of certain tasks from the main flow of a program. This provides you with several advantages, including allowing your user-facing code to run without interruption.
Message passing is a method which program components can use to communicate and exchange information. It can be implemented synchronously or asynchronously and can allow discrete processes to communicate without problems. Message passing is often implemented as an alternative to traditional databases for this type of usage because message queues often implement additional features, provide increased performance, and can reside completely in-memory.
Celery is a task queue that is built on an asynchronous message passing system. It can be used as a bucket where programming tasks can be dumped. The program that passed the task can continue to execute and function responsively, and then later on, it can poll celery to see if the computation is complete and retrieve the data.
While celery is written in Python, its protocol can be implemented in any language. worker is an implementation of Celery in Python. If the language has an AMQP client, there shouldn’t be much work to create a worker in your language. A Celery worker is just a program connecting to the broker to process messages.
Also, there’s another way to be language independent, and that’s to use REST tasks, instead of your tasks being functions, they’re URLs. With this information you can even create simple web servers that enable preloading of code. Simply expose an endpoint that performs an operation, and create a task that just performs an HTTP request to that endpoint.
Here it is the python example from official documentation:
from celery import Celery
from celery.schedules import crontab
app = Celery()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
# Calls test('hello') every 10 seconds.
sender.add_periodic_task(10.0, test.s('hello'), name='add every 10')
# Calls test('world') every 30 seconds
sender.add_periodic_task(30.0, test.s('world'), expires=10)
# Executes every Monday morning at 7:30 a.m.
sender.add_periodic_task(
crontab(hour=7, minute=30, day_of_week=1),
test.s('Happy Mondays!'),
)
#app.task
def test(arg):
print(arg)
As I can see you need to have 3 types of entities: users (to store email or some other way to reach the user), notifications (to store what you want to send to user - text etc) and schedules (to store when user want to get notifications). You need to store entities of those types in some kind of database.
Schedule should be connected to user, notification should be connected to user and schedule.
Assume you have cron job that starts some script every minute. This script will try to get all notifications connected with schedule for current time (job starting time). Don't forget to implement some type of overlaping prevention.
After this script will place a tasks (with all needed data: type of notification, users who you want to notify etc) in queue (beanstalkd or something). You can create as many workers (even on different physical instances) as you want to serve this queue (without thinking about duplication) - this will give you a great power of scalability.
In case user changed his schedule it will affect all his notification at the same moment. There is no pending notification as they will be served only when they really should be send.
This is a very highlevel description. Many things depends on language, database(s), queue server, wokers implementation.

Celery: Make sure workers are not running only jobs from one user

I have 4 celery workers each with concurrency of 6.
I have users submitting varying number of jobs (from 1 to 20).
How do I ensure that each user's job get equal processing time, and that one user's job do not fill up the queue forcing other user's jobs to wait.
I am afraid if the workers are ending up going through all the jobs submitted by the first user, the other user's queued jobs must wait first user to finish, an inconvenience.
Is there a way to make the celery workers aware of one user's jobs holding up other user's queued jobs . Instead can I run maximum one job from each user at any given time?
I have one queue which I submit all the user's jobs to, would I need to make a queue for each user and somehow have round-robin strategy to pull one job from each user's queue?
At the moment Celery doesn't support priority queues.
Making a queue for each user and scheduling them based on round-robin algorithm seems to be a lot of work.
One simple way to solve your problem is to create a temporary table & store the incoming task details. Send first received task to celery. By the time it's completed, you might have received a lot of tasks from various users. Now based on the user id, completed tasks & uncompleted tasks, you can send most appropriate task to celery for execution.

Sending Reminders for Tasks

I have recently been thinking about possible architecture for a simple task reminder system. User will schedule a task and reminder in form of SMS/email/android needs to be sent to all stakeholders at some x minutes before the task is scheduled to be performed(much in the same way google calendar works). The problem here is to send the reminder at that precise point in time. Here are the two possible approaches I can think of:
Cron: I can setup a cron to run every minute. This will scan the table for notifications which need to be sent in the next minute and simply sends the notifications. But, precision is lost as there is always the chance of that +/-1 min error.
Work Queues: I can simply put a message with appropriate delay in a queue at the time task was scheduled. Workers will send the notification as and when they receive the message. I can add as many workers as I want in case my real time behavior starts getting affected because of load. There are still a few issues. How to choose the appropriate work queue? I have evaluated RabbitMq and Beanstalk. While Rabbitmq follows standard AMQP protocol and is widely suggested, it doesn't provide the delay functionality out of the box. There are ways to simulate this using dead-letter-exchanges but this will not work in my case because the delay needs to be variable. Beanstalk supports this but the problem is that beanstalk queue resides entirely in memory which I don't like(but can live with). Any possible alternatives?
Third Approach: ??????. I am sure a simple desktop notification tool does neither of the two. What technology do they use to achieve the same thing?
We had the same scenario and we use Redis for long schedules even now reminders for up to 2 years. You can use Sorted Set where the timestamp is the score.
We use Beanstalkd delay jobs for those kind of reminders where we know it's relatively short term couple of hours, and there is no cancellations, as removing from beanstalkd a delayed message you need to retain the job id in a database for later removal, and that is no viable.
Although you mention memory limit, we use persistence on both Redis/Beanstalkd

Queue suggestions for deferred execution for a one-off task

I'm looking for a lightweight system that will let me queue up a one-off (non-recurring) task and have it execute at a specific time in the future.
This is for the backend of a game where the user does tasks that are time-based. I need the server to check the status of the user's "job" at the completion time and perform the necessary housekeeping on their game state.
I'm somewhat familiar with Redis, Celery, Beanstalkd, ZeroMQ, et al., but I haven't found any info on scheduling a single unit of work to be executed in the future. (or pop off the queue at a set time) Celerybeat has a scheduler for cron-type recurring tasks, but I didn't see anything for one-off.
I've also seen the "at" command in *nix, but I'm not aware of any frontend for it that can help me manage the jobs.
I realize there are some easy solutions such as ordering keys in Redis and doing a blocking pop, but I'd like to not have to continuously poll a queue to see if the next job is ready.
The closest I've found is the deferred library on GAE, but I was hoping for something that runs on my own Linux box along with my other components.
I'd appreciate any suggestions!
Celery allows you to specify a countdown or an ETA at the call of a task to be executed.
The documentation says it best:
http://docs.celeryproject.org/en/latest/userguide/calling.html#eta-and-countdown