How to debug celery delays and errors? - celery

I am continuing Django project of someone who is using Celery along with Mandrill. There are daily reports which are sent to customers and due to some reason not a single mail is sent for three days, gets accumulated and sent together after three days. Since I am new to Celery, I want to know how to debug celery delays and errors, what are popular commands and execution path to follow?

Short tips:
Set debug=True in celery config, it will take you register and execution time for every task.
Install flower, popular tool for monitoring tasks
Use sentry for handy error tracking and aggregation
Happy debugging ;)

Related

Monitoring SQS+Celery

Is where some flower analog which work with sqs+celery somehow? I found out that in current celery version sqs doesnt support celery events. Maybe where is some other way to get information about how many tasks success, how many crashed etc.
My project use Django+Celery+SQS
ClouWatch can give you some monitoring information:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/sqs-metricscollected.html

How to configure Celery to send email alerts when tasks fail?

How is it possible to configure celery to send email alerts when tasks are failing?
For example I want Celery to notify me when more than 3 tasks fail or more than 10 tasks are being retried.
Is it possible using celery or a utility (e.g. flower) or I have to write my own plugin?
Yes, all you need to do is set CELERY_SEND_TASK_ERROR_EMAILS = True and if Celery process fails django will send message with traceback to all emails set in ADMINS settings.
As far as I know, it's not possible out of the box.
You could write custom client on top of celery or flower or directly accessing RabbitMQ.
What I would do (and I am doing) is simply logging failed tasks and then use something like Graylog2 to monitor the log files, this works for all your infrastructure, not just Celery.
You can also use something like NewRelic which monitors your processes directly and offers many other features. Although email reporting on exceptions is somewhat limited in NewRelic.
A simple client/monitor probably is the quickest solution.

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.

How do I schedule one-time tasks from a Perl CGI application?

I am writing an application to allow users to schedule one-time long-running tasks from a web application (Linux/Apache/CGI::Application). To do this I use the Schedule::At module which is the Perl interface to the "at" command. Since the scheduled tasks are not repeating, I am not considering "cron". I have two issues with "at" though:
Scheduling works fine when my CGI application runs under the suexec wrapper, but not when scheduled by the owner of the Apache process. How can I get scheduling to work in both environments (suexec and no-suexec)?
It appears that the processes scheduled by "at" or Schedule::At have no failure reporting, and I sometimes find that scheduled tasks fail silently. Is there some way to log the fact that the scheduled task (not the scheduler itself) has failed to run?
I am not fixed on "at" and am open to using other, more robust, scheduling methods if there are any.
Thank you for your attention.
I've heard good things about The Schwartz . It doesn't have a delay-until though; you'd submit the jobs via at, but that should solve both of the problems you list above, as long as your submit_job script was simple.
(as a caveat, I've only used Gearman, I think you'd want a reliable job queue for this, a "fire and forget" mechanism, so you can keep your submit_job dumb.)

Queuing systems - what is a good way to start up multiple workers?

How have you set-up one or more worker scripts for queue-oriented systems?
How do you arrange to startup - and restart if necessary - worker scripts as required? (I'm thinking about such tools as init.d/, Ruby-based 'god', DJB's Daemontools, etc, etc)
I'm developing an asynchronous queue/worker system, in this case using PHP & BeanstalkdD (though the actual language and daemon isn't important). The tasks themselves are not too hard - encoding an array with the commands and parameters into JSON for transport through the Beanstalkd daemon, picking them up in a worker script to action them as required.
There are a number of other similar queue/worker setups out there, such as Starling, Gearman, Amazon's SQS and other more 'enterprise' oriented systems like IBM's MQ and RabbitMQ. If you run something like Gearman, or SQS - how do you start and control the worker pool? The questions is on the initial worker startup, and then being able to add additional extra workers, shutting them down at will (though I can send a message through the queue to shut them down - as long as some 'watcher' won't automatically restart them). This is not a PHP problem, it's about straight Unix processes of setting up one or more processes to run on startup, or adding more workers to the pool.
A bash script to loop a script is already in place - this calls the PHP script which then collects and runs tasks from the queue, occasionally exiting to be able to clean itself up (it can also pause a few seconds on failure, or via a planned event). This works fine, and building the worker processes on top of that won't be very hard at all.
Getting a good worker controller system is about flexibility, starting one or two automatically on a machine start, and being able to add a couple more from the command line when the queue is busy, shutting down the extras when no longer required.
I've been helping a friend who's working on a project that involves a Gearman-based queue that will dispatch various asynchronous jobs to various PHP and C daemons on a pool of several servers.
The workers have been designed to behave just like classic unix/linux daemons, thanks to simple shell scripts in /etc/init.d/, and commands like :
invoke-rc.d myWorker start|stop|restart|reload
This mechanism is simple and efficient. And as it relies on standard linux features, even people with a limited knowledge of your app can launch a daemon or stop one, if they know how it's called system-wise (aka "myWorker" in the above example).
Another advantage of this mechanism is it makes your workers pool management easy as well. You could have 10 daemons on your machine (myWorker1, myWorker2, ...) and have a "worker manager" start or stop them depending on the queue length. And as these commands can be run through ssh, you can easily manage several servers.
This solution may sound cheap, but if you build it with well-coded daemons and reliable management scripts, I don't see why it would be less efficient than big-bucks solutions, for any average (as in "non critical") project.
Real message queuing middleware like WebSphere MQ or MSMQ offer "triggers" where a service that is part of the MQM will start a worker when new messages are placed into a queue.
AFAIK, no "web service" queuing system can do that, by the nature of the beast. However I have only looked hard at SQS. There you have to poll the queue, and in Amazon's case overly eager polling is going to cost you some real $$.
I've recently been working on such a tool. It's not entirely finished (thought it should take more than a few more days before I hit something I could call 1.0) and clearly not ready for production yet, but the important part are already coded. Anybody can have a look at the code here: https://gitorious.org/workers_pool.
Supervisor is a good monitor tool. It includes a web UI where you can monitor and manage workers.
Here is a simple config file for a worker.
[program:demo]
command=php worker.php ; php command to run worker file
numprocs=2 ; number of processes
process_name=%(program_name)s_%(process_num)03d ; unique name for each process if numprocs > 1
directory=/var/www/demo/ ; directory containing worker file
stdout_logfile=/var/www/demo/worker.log ; log file location
autostart=true ; auto start program when supervisor starts
autorestart=true ; auto restart program if it exits