I use celery beat worker to create new process that use billiard and set daemonic, but daemonic is not working - celery

billiard version 3.5.0.5
celery 4.0.2
Steps to reproduce
I want control long time beat task. like if a task not finish, not run a new process.
def lock_beat_task(gap_time, flush_time):
def decorate(func):
"""
Celery timing task solves the problem that the previous task is executed again when it is not executed.
problem: Celery is similar to crontab, it will be scheduled regularly, regardless of whether the previous task is executed or not.
achieve: With the help of the cache, the key is the class name, and each time the schedule is checked, it is checked whether it is executed before, and the key is deleted after the execution.
:param func:
:return:
"""
#wraps(func)
def wrapper(*args, **kwargs):
key_name = func.__name__
logger.info('++++++++{} enter +++++++.'.format(key_name))
monitor = BeatWokerMonitor(key_name, gap_time, flush_time)
mo_process = monitor.monitor()
if mo_process:
try:
logger.info('++++++++{} is running.++++++++'.format(key_name))
func(*args, **kwargs)
logger.info('{} is graceful end.'.format(key_name))
except KeyboardInterrupt, e:
monitor.logger.info('{} KeyboardInterrupt reset succ'.format(key_name))
finally:
monitor.reset()
# mo_process.join()
else:
logger.info('{} is running or gap time is not over.'.format(key_name))
logger.info('{} is end!---.'.format(key_name))
return wrapper
return decorate
class BeatWokerMonitor(object):
"""
Used to maintain/monitor the health status of the beat workerCall beat_worker_shoud_run.
If there is no key_name in redis, or the time difference between the current time and key_name is
greater than gap_time, create a monitor daemon.
This daemon is responsible for refreshing the time in key_name at a fixed
time.Beat_worker_shoud_run returns true(should run)。 Otherwise return None
"""
def __init__(self, key_name, gap_time, flush_time):
"""
秒级时间
:param key_name:
:param gap_time:
:param flush_time:
"""
self.key_name = key_name
self.gap_time = gap_time
self.flush_time = flush_time
self.db_redis = get_redis_conn(11)
def start_monitor(self):
flush_key_process = Process(target=self.flush_redis_key_gap_time, name='{}_monitor'.format(self.key_name), daemon=True)
flush_key_process.start()
return flush_key_process
def monitor(self):
rd_key_value = self.get_float_rd_key_value()
if not rd_key_value:
v = time.time()
self.db_redis.set(self.key_name, v)
return self.start_monitor()
if time.time() - rd_key_value > self.gap_time:
return self.start_monitor()
def get_float_rd_key_value(self):
value = self.db_redis.get(self.key_name)
if value:
return float(value)
else:
return 0
def flush_redis_key_gap_time(self):
old_time = self.get_float_rd_key_value()
self.logger.info('{} start flush, now is {}'.format(self.key_name, old_time))
while 1:
if time.time() - old_time > self.flush_time:
now = time.time()
self.db_redis.set(self.key_name, now)
old_time = now
self.logger.info('{} monitor flush time {}'.format(self.key_name, now))
else:
self.logger.info('{} not flush {} , {}'.format(self.key_name, time.time() - old_time, self.flush_time))
def reset(self):
self.db_redis.delete(self.key_name)
and task code. you can write some short time task to test .
#app.task
#lock_beat_task(5*60, 10*3)
#send_update_msg("update")
def beat_update_all():
"""
:return:
"""
from crontab.update import EAllTicket
eall = EAllTicket()
send_task_nums = eall.run()
return send_task_nums
Expected behavior
want background run, so can not use multiprocessing to create child process.
use billiard instead
want beat_update_all() finished and monitor process will kill itself.
Actual behavior
beat_update_all() finished and monitor process is still running.

Related

Setting Celery Task attributes (i.e time_limit and soft_time_limit) does not work

According to this thread, the problem is resolved but seems like it is not .
Setting Time Limit on specific task with celery
My current Celery version is 3.1.18 (Cipater).
I am trying overwrite default settings of a Task. Objective is to change the softtimelimit and hard time limit of the task because the same task is being used for multiple purposes.
Passing soft_time_limit and time_limit to MyTask constructor to change default settings.
///celery/app/ task.py
class MyTask(task.Task):
time_limit = 100
soft_time_limit = 110
max_retries = 0
def __init__(self, time_limit=None, soft_time_limit=None,
max_retries=None, *args, **kwargs):
if time_limit:
self.time_limit = time_limit
if soft_time_limit:
self.soft_time_limit = soft_time_limit
if max_retries:
self.max_retries = max_retries
task.Task.__init__(self, *args, **kwargs)
t1 = MyTask(time_limit=30, soft_time_limit=20,
max_retries=5)
or
t1 = MyTask()
t1.time_limit = 30
t1.soft_time_limit = 20
Then pass the t1.si() to task.RetryableChain(...)
job = task.RetryableChain(...)
job.delay()
When the run method is being called by worker, it still receives the old value (time_limit = 100) where as I have set time_limit = 30.
Please let me know if the issue is still exist in 3.1.18 version.
I had to fix celery code to make it work. This is definitely a temporary fix but it works. I am not sure when the attributes are set with new values then why those are not transferred to worker.job. I can sense that when we called task.si or s() it creates a Signature instance which does not hold these time_limit attributes, so it takes from the original values stored in class. Just a thought.
t1 = MyTask()
kwargs = {}
kwargs['time_limit'] = 30
kwargs['soft_time_limit'] = 40
t.s(kwargs)
---->>>
/celery/worker/job.py
def execute_using_pool(self, pool, **kwargs):
"""Used by the worker to send this task to the pool.
:param pool: A :class:`celery.concurrency.base.TaskPool` instance.
:raises celery.exceptions.TaskRevokedError: if the task was revoked
and ignored.
"""
uuid = self.id
task = self.task
if self.revoked():
raise TaskRevokedError(uuid)
hostname = self.hostname
kwargs = self.kwargs
if task.accept_magic_kwargs:
kwargs = self.extend_with_default_kwargs()
request = self.request_dict
request.update({'hostname': hostname, 'is_eager': False,
'delivery_info': self.delivery_info,
'group': self.request_dict.get('taskset')})
timeout, soft_timeout = request.get('timelimit', (None, None))
# timeout = timeout or task.time_limit
# soft_timeout = soft_timeout or task.soft_time_limit
**# SKAR request.get(‘time limit’) always returns the original value stored in Task.
timeout = kwargs.get('time_limit', task.time_limit)
soft_timeout = kwargs.get('soft_time_limit', task.soft_time_limit)**
result = pool.apply_async(
trace_task_ret,
args=(self.name, uuid, self.args, kwargs, request),
accept_callback=self.on_accepted,

Schedule Celery task to run after other task(s) complete

I want to accomplish something like this:
results = []
for i in range(N):
data = generate_data_slowly()
res = tasks.process_data.apply_async(data)
results.append(res)
celery.collect(results).then(tasks.combine_processed_data())
ie launch asynchronous tasks over a long period of time, then schedule a dependent task that will only be executed once all earlier tasks are complete.
I've looked at things like chain and chord, but it seems like they only work if you can construct your task graph completely upfront.
For anyone interested, I ended up using this snippet:
#app.task(bind=True, max_retries=None)
def wait_for(self, task_id_or_ids):
try:
ready = app.AsyncResult(task_id_or_ids).ready()
except TypeError:
ready = all(app.AsyncResult(task_id).ready()
for task_id in task_id_or_ids)
if not ready:
self.retry(countdown=2**self.request.retries)
And writing the workflow something like this:
task_ids = []
for i in range(N):
task = (generate_data_slowly.si(i) |
process_data.si(i)
)
task_id = task.delay().task_id
task_ids.append(task_id)
final_task = (wait_for(task_ids) |
combine_processed_data.si()
)
final_task.delay()
That way you would be running your tasks synchronously.
The solution depends entirely on how and where data are collected. Roughly, given that generate_data_slowly and tasks.process_data are synchronized, a better approach would be to join both in one task (or a chain) and to group them.
chord will allow you to add a callback to that group.
The simplest example would be:
from celery import chord
#app.task
def getnprocess_data():
data = generate_data_slowly()
return whatever_process_data_does(data)
header = [getnprocess_data.s() for i in range(N)]
callback = combine_processed_data.s()
chord(header)(callback).get()

Running concurrent mongoengine queries with asyncio

I have 4 functions that basically build queries and execute them. I want to make them run simultaneously using asyncio. My implementation of asyncio seems correct as non mongodb tasks run as they should( example asyncio.sleep()). Here is the code:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
tasks = [
service.async_get_associate_opportunity_count_by_user(me, criteria),
service.get_new_associate_opportunity_count_by_user(me, criteria),
service.async_get_associate_favorites_count(me, criteria=dict()),
service.get_group_matched_opportunities_count_by_user(me, criteria)
]
available, new, favorites, group_matched = loop.run_until_complete(asyncio.gather(*tasks))
stats['opportunities']['available'] = available
stats['opportunities']['new'] = new
stats['opportunities']['favorites'] = favorites
stats['opportunities']['group_matched'] = group_matched
loop.close()
# functions written in other file
#asyncio.coroutine
def async_get_ass(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of available: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def get_new_associate_opportunity_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of new: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def async_get_associate_favorites_count(self, user, criteria={}, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
favorites = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of favorites: {}".format(run_time))
yield from asyncio.sleep(2)
return favorites
#asyncio.coroutine
def get_group_matched_opportunities_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of group matched: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
The yield from asyncio.sleep(2) is just to show that the functions run asynchronously. Here is the output on the terminal:
runtime of group matched: 0.11431598663330078
runtime of favorites: 0.0029871463775634766
Timestamp function run time: 0.0004897117614746094
runtime of new: 0.15225648880004883
runtime of available: 0.13006806373596191
total run time: 2403.2700061798096
From my understanding, apart from the 2000ms that gets added to the total run time due to the sleep function, it shouldn't be more than 155-160ms as the max run time among all functions is this value.
I'm currently looking into motorengine(a port of mongoengine 0.9.0) that apparently enables asynchronous mongodb queries but I think I won't be able to use it since my models have been defined using mongoengine. Is there a workaround to this problem?
The reason your queries aren't running in parallel is because whenever you run Opportunity.objects(query).count() in your coroutines, the entire event loop blocks because those methods are doing blocking IO.
So you need a mongodb driver which can do async/non-blocking IO. You are on the correct path with trying to use motorengine, but as far as I can tell it's written for the Tornado asynchronous framework. To get it to work with asyncio you would have to hookup Tornado and asycnio. See, http://tornado.readthedocs.org/en/latest/asyncio.html on how to do that.
Another option is to use asyncio-mongo, but it doesn't have a mongoengine compatibale ORM, so you might have to rewrite most of your code.

Avoiding duplicate tasks in celery broker

I want to create the following flow using celery configuration\api:
Send TaskA(argB) Only if celery queue has no TaskA(argB) already pending
Is it possible? how?
You can make your job aware of other tasks by some sort of memoization. If you use a cache control key (redis, memcached, /tmp, whatever is handy), you can make execution depend on that key. I'm using redis as an example.
from redis import Redis
#app.task
def run_only_one_instance(params):
try:
sentinel = Redis().incr("run_only_one_instance_sentinel")
if sentinel == 1:
#I am the legitimate running task
perform_task()
else:
#Do you want to do something else on task duplicate?
pass
Redis().decr("run_only_one_instance_sentinel")
except Exception as e:
Redis().decr("run_only_one_instance_sentinel")
# potentially log error with Sentry?
# decrement the counter to insure tasks can run
# or: raise e
I cannot think of a way but to
Retrieve all executing and scheduled tasks via celery inspect
Iterate through them to see if your task is there.
check this SO question to see how the first point is done.
good luck
I don't know it's gonna help you more than the other answers, but there goes my approach, following the same idea given by srj. I needed a way to block my server to launch tasks with same id to queue. So I made a general function to help me.
def is_task_active_or_registered(app, task_id):
i = app.control.inspect()
active_dict = i.active()
scheduled_dict = i.scheduled()
keys_set = set(active_dict.keys() + scheduled_dict.keys())
tasks_ids_set = set()
for _dict in [active_dict, scheduled_dict]:
for k in keys_set:
for task in _dict[k]:
tasks_ids_set.add(task['id'])
if task_id in tasks_ids_set:
return True
else:
return False
So, I use it like this:
In the context where my celery-app object is available, I define:
def check_task_can_not_run(task_id):
return is_task_active_or_registered(app=celery, task_id=task_id)
And so, from my client request, I call this check_task_can_not_run(...) and block task from being launched in case of True.
I was facing similar problem. The Beat was making duplicates in my queue. I wanted to use expires but this feature isn't working properly https://github.com/celery/celery/issues/4300.
So here is scheduler which checks if task has been already enqueued (based on task name).
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import json
from heapq import heappop, heappush
from celery.beat import event_t
from celery.schedules import schedstate
from django_celery_beat.schedulers import DatabaseScheduler
from typing import List, Optional
from typing import TYPE_CHECKING
from your_project import celery_app
if TYPE_CHECKING:
from celery.beat import ScheduleEntry
def is_task_in_queue(task, queue_name=None):
# type: (str, Optional[str]) -> bool
queues = [queue_name] if queue_name else celery_app.amqp.queues.keys()
for queue in queues:
if task in get_celery_queue_tasks(queue):
return True
return False
def get_celery_queue_tasks(queue_name):
# type: (str) -> List[str]
with celery_app.pool.acquire(block=True) as conn:
tasks = conn.default_channel.client.lrange(queue_name, 0, -1)
decoded_tasks = []
for task in tasks:
j = json.loads(task)
task = j['headers']['task']
if task not in decoded_tasks:
decoded_tasks.append(task)
return decoded_tasks
class SmartScheduler(DatabaseScheduler):
"""
Smart means that prevents duplicating of tasks in queues.
"""
def is_due(self, entry):
# type: (ScheduleEntry) -> schedstate
is_due, next_time_to_run = entry.is_due()
if (
not is_due or # duplicate wouldn't be created
not is_task_in_queue(entry.task) # not in queue so let it run
):
return schedstate(is_due, next_time_to_run)
# Task should be run (is_due) and it is present in queue (is_task_in_queue)
H = self._heap
if not H:
return schedstate(False, self.max_interval)
event = H[0]
verify = heappop(H)
if verify is event:
next_entry = self.reserve(entry)
heappush(H, event_t(self._when(next_entry, next_time_to_run), event[1], next_entry))
else:
heappush(H, verify)
next_time_to_run = min(verify[0], next_time_to_run)
return schedstate(False, min(next_time_to_run, self.max_interval))

Replacing Celerybeat with Chronos

How mature is Chronos? Is it a viable alternative to scheduler like celery-beat?
Right now our scheduling implements a periodic "heartbeat" task that checks of "outstanding" events and fires them if they are overdue. We are using python-dateutil's rrule for defining this.
We are looking at alternatives to this approach, and Chronos seems a very attactive alternative: 1) it would mitigate the necessity to use a heartbeat schedule task, 2) it supports RESTful submission of events with ISO8601 format, 3) has a useful interface for management, and 4) it scales.
The crucial requirement is that scheduling needs to be configurable on the fly from the Web Interface. This is why can't use celerybeat's built-in scheduling out of the box.
Are we going to shoot ourselves in the foot by switching over to Chronos?
This SO has solutions to your dynamic periodic task problem. It's not the accepted answer at the moment:
from djcelery.models import PeriodicTask, IntervalSchedule
from datetime import datetime
class TaskScheduler(models.Model):
periodic_task = models.ForeignKey(PeriodicTask)
#staticmethod
def schedule_every(task_name, period, every, args=None, kwargs=None):
""" schedules a task by name every "every" "period". So an example call would be:
TaskScheduler('mycustomtask', 'seconds', 30, [1,2,3])
that would schedule your custom task to run every 30 seconds with the arguments 1 ,2 and 3 passed to the actual task.
"""
permissible_periods = ['days', 'hours', 'minutes', 'seconds']
if period not in permissible_periods:
raise Exception('Invalid period specified')
# create the periodic task and the interval
ptask_name = "%s_%s" % (task_name, datetime.datetime.now()) # create some name for the period task
interval_schedules = IntervalSchedule.objects.filter(period=period, every=every)
if interval_schedules: # just check if interval schedules exist like that already and reuse em
interval_schedule = interval_schedules[0]
else: # create a brand new interval schedule
interval_schedule = IntervalSchedule()
interval_schedule.every = every # should check to make sure this is a positive int
interval_schedule.period = period
interval_schedule.save()
ptask = PeriodicTask(name=ptask_name, task=task_name, interval=interval_schedule)
if args:
ptask.args = args
if kwargs:
ptask.kwargs = kwargs
ptask.save()
return TaskScheduler.objects.create(periodic_task=ptask)
def stop(self):
"""pauses the task"""
ptask = self.periodic_task
ptask.enabled = False
ptask.save()
def start(self):
"""starts the task"""
ptask = self.periodic_task
ptask.enabled = True
ptask.save()
def terminate(self):
self.stop()
ptask = self.periodic_task
self.delete()
ptask.delete()
I haven't used djcelery yet, but it supposedly has an admin interface for dynamic periodic tasks.