I have started a dramatiq worker to do some task and after a point, it is just stuck and throws this below-mentioned error after some time.
[MainThread] [dramatiq.MainProcess] [CRITICAL] Worker with PID 53 exited unexpectedly (code -9). Shutting down...
What can be the potential reason for this to occur? Are System resources a constraint?
This queuing task is run inside a Kubernetes pod
Please check kernel logs (/var/log/kern.log and /var/log/kern.log.1)
The Worker might be getting killed due to OOMKiller (OutOfMemory).
To resolve this try to increase the memory if you are running in a docker or pod.
Related
I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.
I am creating development and pod in lens. There specific program is launched. This program prints logs and has heartbeat every 10 minutes. From time to time, this heartbeat just stops, without any exception and programs stops working, but Pod does not restart, it continues working, like nothing happened. Has anyone faced the problem?
I need a help with a long running dag that keeps on failing after an hour but the task is still in running mode.
I have been using Airflow for the past 6-8 months. I with the help of our infrastructure team has setup Airflow in our company. It’s running on a AWS ECS cluster. The dags sit in an EFS instance with throughput set to provisioned. The logs are written in a s3 bucket.
For the worker aws ecs service we have an autoscaling policy that scales up the cluster at night 1 AM and scales down at 4AM.
It’s running fine for short duration jobs. It also was successful with a long duration job that was writing the results into a redshift table intermittently.
But now I have a job that is looping over a pandas dataframe and updating two dictionaries.
Issue:
It takes about 4 hrs for the job to finish but at around 1 hr it automatically fails without any error. The task still is in running mode until I manually stop it. And when I try to look at the logs the actual log doesn’t come up It shows
[2021-05-04 19:59:18,785] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: app-doctor-utilisation.execute 2021-05-04T18:57:10.480384+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2021-05-04 19:59:18,786] {local_task_job.py:90} INFO - Task is not able to be run
Now when I stop the task I can see some of the logs and the following logs at the end.
[2021-05-04 20:11:11,785] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 38
[2021-05-04 20:11:11,787] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-05-04 20:11:11,959] {helpers.py:291} INFO - Process psutil.Process(pid=38, status='terminated', exitcode=0, started='18:59:13') (38) terminated with exit code 0
[2021-05-04 20:11:11,960] {local_task_job.py:102} INFO - Task exited with return code 0
Can someone please help me figure out the issue and if there is any solution for this?
How does it enforce the limit? Is it via cgroups? Or is there an actual process watching container processes and terminating them?
I seem to have a container process that gets a SIGKILL, but the pod does not get restarted (but the process dies because SIGKILL). So I'm unsure what the cause of it is.
https://github.com/kubernetes/kubernetes/issues/50632
IIUC, the process which consumes the most memory will be oom-killed in
this case. The container won't terminate unless the killed perocess is
the main process within the container.
When you start a process using supervisord it is in "STARTING" status then if it gets trouble it gets in "BACKOFF" status if there is an autorestart set to true.
I don't want to wait for "startretries" to be attempted, I want to stop the restarting process manually using supervisorctl. The only way I found to do so is to stop the entire supervisord service and start it again (every process go in "STOPPED" status if there is no autostart).
Is there a better way to do so (force "STOPPED" status from "BACKOFF" status) as I have other processes managed in supervisord that I don't want to stop?
If I try to stop it with
supervisorctl stop process
I get
FAILED: attempted to kill process with sig SIGTERM but it wasn't running
If I try to start it with
supervisorctl start process
I get
process: ERROR (already started)
Of course I could disable the autorestart, but it can be useful, a workaround is to limit the startretries, is there a better solution?
Hey this maybe help for you:
When an autorestarting process is in the BACKOFF state, it will be
automatically restarted by supervisord. It will switch between
STARTING and BACKOFF states until it becomes evident that it cannot be started because the number of startretries has exceeded
the maximum, at which point it will transition to the FATAL state.
Each start retry will take progressively more time.
so you don't need to stop the BACKOFF process manually. if you do not want to wait too long time, it is better to set a little number to startretries.
see more info here: http://supervisord.org/subprocess.html
GOOD LCUKY
Use the following command to force supervisor to stop a process in the BACKOFF state.
supervisorctl stop <gname>:*