We are using celery flower for monitoring the celery workers. We are using rabbitmq as message broker. The problem here is after 2 days of starting celery flower monitoring service, it starts consuming more than 100% of cpu load. Then we will get messages of cpu oveload. We are not able to figure out why celery flower consumes lot of cpu load. Please share your suggestions.
Related
Using its statsd plugin, Airflow can report on metric executor.queued_tasks as well as some others.
I am using CeleryExecutor and need to know how many tasks are waiting in the Celery broker, so I know when new workers should be spawned. Indeed, I set my workers so they cannot take many tasks concurrently. Is this metric what I need?
Nope. If you want to know how many TIs are waiting in the broker, you'll have to connect to it.
Task instances that are waiting to get picked up in the celery broker are queued according to the Airflow DB, but running according to the CeleryExecutor. This is because the CeleryExecutor considers that any task instance that was successfully sent to the broker is now running (unlike the DB, which waits for a worker to pick it up before marking it as running).
Metric executor.queued_tasks reports the number of tasks queued according to the executor, not the DB.
The number of queued task instances according to the DB is not exactly what you need either, because it reports the number of task instances that are waiting in the broker plus the number of task instances queued to the executor. But when would TIs be stuck in the executor's queue, you ask? When the parallelism setting of Airflow prevents the executor from sending them to the broker.
We are using Airflow(1.10.3) with Celery executor(4.1.1 (latentcall)) and broker SQS. While debugging an issue we tried our hands on celery CLI and found out that SQS broker is not supported for any of the inspect commands or monitoring tool eg. Flower.
Is there any way we can monitor the tasks or events on celery workers?
We have tried the celery monitor as follows:
celery events -b sqs://
But it shows no worker discovered and no tasks selected.
The celery inspect command help page shows:
Availability: RabbitMQ (AMQP) and Redis transports.
Please let me know if I am missing something or is it even possible to monitor celery workers with SQS.
SQS transport does not provide support for monitoring/inspection (this is the main reason why I do not use it)... According to the latest documentation Redis and RabbitMQ are the only broker types that have support for monitoring/inspection and remote control.
I have the main producer of tasks in a webserver. I do not want the webserver to consume any tasks, so it should only send the tasks to the broker which get consumed by other nodes.
Right now I route tasks using the -Q option in the nodes by specifying the particular queues for each node. Is there a way to specify 0 queues for a worker?
Any help appreciated, thanks!
You do not need to use a worker to push tasks to the broker - you can do that from a regular python process.
from the description of Storm, it is based on Zookeeper, and whenever a worker node dies, it can be recovered and get its state from zookeeper.
Does any one know how that is done? specifically
how does the failed worker node get recovered?
how does zookeeper keep its state. AFAIK, each zone can only store a small amount to data.
Are you talking about workers or supervisors? Each storm worker node runs a storm "supervisor" daemon which manages worker processes.
You need to setup supervision (something like daemontools or supervisord, which is unrelated to storm supervisors) to monitor and restart nimbus and supervisor daemons in case they take an exception. Both nimbus and supervisors are fail fast and stateless. Zookepeer is used for coordination between nimbus and supervisors along with holding state information, which is in zookeeper or on disk so as to not lose state information.
State data isn't large and Zookeeper should be run supervised too.
Check this for more fault tolerance details.
How do I make storm-nimbus to restart worker on the same machine?
To test the fault tolerance, I do a kill -9 on a worker process expecting the worker to be restarted on the same machine, but on one of the machines, nimbus launches the worker on another machine!!!
Nimbus log does not show several tries or anything unusual or errors!
Would appreciate any help, Thanks!
You shouldn't need to. Workers should be able to switch to an open slot on any supervisor. If you have a bolt that doesn't accomodate this because it is reading data on a particular supervisor, this is a design problem.
Additionally, Storm's fault tolerance is intended to handle not only worker failures, but also supervisor failures, in which case you won't be able to restart a worker on the same supervisor. You shouldn't need to worry where a worker is: that's a feature of Storm.