task executed slowly after celery serving for long time but still work quickly after celery restart - celery

This kind of issues occurs several times
The log shows that when celery serves for a long time, there was a stuck about 50s. I dont know what happened in that "break"
But it works properly after restarting celery and indeed this task only consumes less than 10s
I suppose there was something overloaded in the backend of celery then be released after restart.
But I have no idea what's the real problem and hardly come up with such issue

Related

Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

What's a good way to deploy and update a long running process?

I have a process running on a server.
FYI I'm trying to solve the same problem for both nodejs and Python - I don't really think the specific server/languages matters as the question is more about approach to deployment.
The work the the process does might take anywhere from seconds to hours to run.
I'm trying to work out how to deploy updated code for the process.
I don't want to just stop the process in the middle of what it is doing for fear of losing all the work done so far in the long running process.
So what's a good way to get the process to gracefully exit and restart when new code has arrived?
I use systemd for running the nodejs service.
I use Ansible to deploy updates, not that this is really relevant.
I thought maybe at the end of each execution of the long running process the server could check to see if some file has been placed on the disk as some sort of flag to indicate it should exit and restart, but that seems kinda brittle and hacky.
Anyone got any better mechanisms for this sort of thing?

Will celery handle tasks gracefully if supervisorctl stop/start/restart is used?

In a case recently, I had to restart some inexplicably idle workers run by supervisord. We are thinking about adding a periodic restart, say, once or twice a day.
This could easily be done using supervisorctl, but is there any chance tasks will be lost while the restart occurs?

Chronos + Mesosphere. How to execute tasks in parallel?

Good day everyone.
I have single server for Chronos, Mesos and Zookeeper, and i want to use Chronos as something, what will run my scripts daily. Some scripts today, some tomorrow and so on..
The problem is when i'm trying to launch tasks one after another, only first one executes correctly, another one is lost somewhere. If i launch first then take a pause of 3-4 seconds and launch another - they both are launched, but sequentially.
And i need to run them in parallel.
Can someone provide a hint on this? Maybe there is some settings that i must change?
You should set a time in UTC time for both tasks to be launched with a repeating period of 24 hours. In this case, there is no reason why your tasks should not execute in parallel. Check the chronos logs and the tasks logs in sandbox on mesos for errors.
You can certainly run all of these components (Chronos, master, slave, and ZK) on the same machine, although ZK really becomes valuable once you have HA with multiple masters.
As user4103259 suggested, check the master and slave logs for that LOST/failed taskId to see what exactly happened to it. A task could go LOST/failed for numerous reasons, anywhere along the task launch/running/completing process.

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.