Dotcloud supervisord shows error but process is running - dotcloud

My dotcloud setup (django-celery with rabbitmq) was working fine a week ago - the processes were starting up ok and the logs were clean. However, I recently repushed (without updating any of the code), and now the logs are saying that the processes fail to start even though they seem to be running.
Supervisord log
dotcloud#hack-default-www-0:/var/log/supervisor$ more supervisord.log
2012-06-03 10:51:51,836 CRIT Set uid to user 1000
2012-06-03 10:51:51,836 WARN Included extra file "/etc/supervisor/conf.d/uwsgi.c
onf" during parsing
2012-06-03 10:51:51,836 WARN Included extra file "/home/dotcloud/current/supervi
sord.conf" during parsing
2012-06-03 10:51:51,938 INFO RPC interface 'supervisor' initialized
2012-06-03 10:51:51,938 WARN cElementTree not installed, using slower XML parser
for XML-RPC
2012-06-03 10:51:51,938 CRIT Server 'unix_http_server' running without any HTTP
authentication checking
2012-06-03 10:51:51,946 INFO daemonizing the supervisord process
2012-06-03 10:51:51,947 INFO supervisord started with pid 144
2012-06-03 10:51:53,128 INFO spawned: 'celerycam' with pid 159
2012-06-03 10:51:53,133 INFO spawned: 'apnsd' with pid 161
2012-06-03 10:51:53,148 INFO spawned: 'djcelery' with pid 164
2012-06-03 10:51:53,168 INFO spawned: 'uwsgi' with pid 167
2012-06-03 10:51:53,245 INFO exited: djcelery (exit status 1; not expected)
2012-06-03 10:51:53,247 INFO exited: celerycam (exit status 1; not expected)
2012-06-03 10:51:54,698 INFO spawned: 'celerycam' with pid 176
2012-06-03 10:51:54,698 INFO success: apnsd entered RUNNING state, process has s
tayed up for > than 1 seconds (startsecs)
2012-06-03 10:51:54,705 INFO spawned: 'djcelery' with pid 177
2012-06-03 10:51:54,706 INFO success: uwsgi entered RUNNING state, process has s
tayed up for > than 1 seconds (startsecs)
2012-06-03 10:51:54,731 INFO exited: djcelery (exit status 1; not expected)
2012-06-03 10:51:54,754 INFO exited: celerycam (exit status 1; not expected)
2012-06-03 10:51:56,760 INFO spawned: 'celerycam' with pid 178
2012-06-03 10:51:56,765 INFO spawned: 'djcelery' with pid 179
2012-06-03 10:51:56,790 INFO exited: celerycam (exit status 1; not expected)
2012-06-03 10:51:56,791 INFO exited: djcelery (exit status 1; not expected)
2012-06-03 10:51:59,798 INFO spawned: 'celerycam' with pid 180
2012-06-03 10:52:00,538 INFO spawned: 'djcelery' with pid 181
2012-06-03 10:52:00,565 INFO exited: celerycam (exit status 1; not expected)
2012-06-03 10:52:00,571 INFO gave up: celerycam entered FATAL state, too many st
art retries too quickly
2012-06-03 10:52:00,573 INFO exited: djcelery (exit status 1; not expected)
2012-06-03 10:52:01,575 INFO gave up: djcelery entered FATAL state, too many sta
rt retries too quickly
dotcloud#hack-default-www-0:/var/log/supervisor$
The djerror log:
dotcloud#hack-default-www-0:/var/log/supervisor$ more djcelery_error.log
Traceback (most recent call last):
File "hack/manage.py", line 2, in
from django.core.management import execute_manager
ImportError: No module named django.core.management
Traceback (most recent call last):
File "hack/manage.py", line 2, in
from django.core.management import execute_manager
ImportError: No module named django.core.management
Traceback (most recent call last):
File "hack/manage.py", line 2, in
from django.core.management import execute_manager
ImportError: No module named django.core.management
Traceback (most recent call last):
File "hack/manage.py", line 2, in
from django.core.management import execute_manager
ImportError: No module named django.core.management
dotcloud#hack-default-www-0:/var/log/supervisor$
The statusctrl shows that some processes are running, but the pids are different. Also, the celery functionality seems to be working ok. Messages are processed, and I can see the messages being processed in the django admin interface (dj celery cam is running).
# supervisorctl status
apnsd RUNNING pid 225, uptime 0:00:44
celerycam RUNNING pid 224, uptime 0:00:44
djcelery RUNNING pid 226, uptime 0:00:44
Supervisord.conf file:
[program:djcelery]
directory = /home/dotcloud/current/
command = python hack/manage.py celeryd -E -l info -c 2
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
[program:celerycam]
directory = /home/dotcloud/current/
command = python hack/manage.py celerycam
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
http://jefurii.cafejosti.net/blog/2011/01/26/celery-in-virtualenv-with-supervisord/ says that the problem may be that the python being used is incorrect, so I've explicitly specified the python in the supervisord file. It now works, but it doesn't explain what I'm seeing above and why I've had to change my configuration when it was working fine last week.
Also, not all of the pids are lining up:
2012-06-03 11:19:03,045 CRIT Server 'unix_http_server' running without any HTTP
authentication checking
2012-06-03 11:19:03,051 INFO daemonizing the supervisord process
2012-06-03 11:19:03,052 INFO supervisord started with pid 144
2012-06-03 11:19:04,061 INFO spawned: 'celerycam' with pid 151
2012-06-03 11:19:04,066 INFO spawned: 'apnsd' with pid 153
2012-06-03 11:19:04,085 INFO spawned: 'djcelery' with pid 155
2012-06-03 11:19:04,104 INFO spawned: 'uwsgi' with pid 156
2012-06-03 11:19:05,271 INFO success: celerycam entered RUNNING state, process h
as stayed up for > than 1 seconds (startsecs)
2012-06-03 11:19:05,271 INFO success: apnsd entered RUNNING state, process has s
tayed up for > than 1 seconds (startsecs)
2012-06-03 11:19:05,271 INFO success: djcelery entered RUNNING state, process ha
s stayed up for > than 1 seconds (startsecs)
2012-06-03 11:19:05,271 INFO success: uwsgi entered RUNNING state, process has s
tayed up for > than 1 seconds (startsecs)
the status shows that the celery cam pids aren't lining up:
# supervisorctl status
apnsd RUNNING pid 153, uptime 0:06:17
celerycam RUNNING pid 150, uptime 0:06:17
djcelery RUNNING pid 155, uptime 0:06:17

My first guess is that your using the wrong python binary (system python, instead of virtualenv python), and it is causing this error (below) because that system python binary doesn't have that package installed.
Traceback (most recent call last):
File "hack/manage.py", line 2, in
from django.core.management import execute_manager
ImportError: No module named django.core.management
You should change your supervisord.conf to the following to make sure you are pointing to the correct python version.
[program:djcelery]
directory = /home/dotcloud/current/
command = /home/dotcloud/env/bin/python hack/manage.py celeryd -E -l info -c 2
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
[program:celerycam]
directory = /home/dotcloud/current/
command = /home/dotcloud/env/bin/python hack/manage.py celerycam
stderr_logfile = /var/log/supervisor/%(program_name)s_error.log
stdout_logfile = /var/log/supervisor/%(program_name)s.log
The python path went fromt python to /home/dotcloud/env/bin/python.
I'm not sure why supervisor is saying it is running when it is not, but hopefully this one little change will help clear up your errors, and get everything back to working again.

Related

How to start Cloud SQL proxy with supervisor

I tried to start a CloudSQL proxy on supervisor, however I have no idea what is wrong with it. The documentation does not show any clues to this issue. Any ideas would be much appreciated.
I tried the setup on a clean Ubuntu 16 and then installed supervisor and downloaded cloud_sql_proxy. And I put files under /root and execute as root for debugging.
Here is my current setup:
/etc/supervisord.conf
[unix_http_server]
file=/tmp/supervisor.sock ; the path to the socket file
chmod=0766 ; socket file mode (default 0700)
[supervisord]
logfile=/tmp/supervisord.log ; main log file; default $CWD/supervisord.log
logfile_maxbytes=50MB ; max main logfile bytes b4 rotation; default 50MB
logfile_backups=10 ; # of main logfile backups; 0 means none, default 10
loglevel=info ; log level; default info; others: debug,warn,trace
pidfile=/tmp/supervisord.pid ; supervisord pidfile; default supervisord.pid
nodaemon=false ; start in foreground if true; default false
minfds=1024 ; min. avail startup file descriptors; default 1024
minprocs=200 ; min. avail process descriptors;default 200
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL for a unix socket
[include]
files = /etc/supervisor/conf.d/*.conf
/etc/supervisor/conf.d/cloud_sql_proxy.conf
[program:cloud_sql_proxy]
command=/root/cloud_sql_proxy -dir=/cloudsql -instances="project_id:us-central1:instance-name" -credential_file="/root/service-account.json"
autostart=true
autorestart=true
startretries=1
startsecs=8
stdout_logfile=/var/log/cloud_sql_proxy-stdout.log
stderr_logfile=/var/log/cloud_sql_proxy-stderr.log
I got the following error after inspecting /tmp/supervisord.log:
2018-10-14 15:49:49,984 INFO spawned: 'cloud_sql_proxy' with pid 3569
2018-10-14 15:49:49,989 INFO exited: cloud_sql_proxy (exit status 0; not expected)
2018-10-14 15:49:50,991 INFO spawned: 'cloud_sql_proxy' with pid 3574
2018-10-14 15:49:50,996 INFO exited: cloud_sql_proxy (exit status 0; not expected)
2018-10-14 15:49:51,998 INFO gave up: cloud_sql_proxy entered FATAL state, too many start retries too quickly
2018-10-14 15:51:46,981 INFO spawned: 'cloud_sql_proxy' with pid 3591
2018-10-14 15:51:46,986 INFO exited: cloud_sql_proxy (exit status 0; not expected)
2018-10-14 15:51:47,989 INFO spawned: 'cloud_sql_proxy' with pid 3596
2018-10-14 15:51:47,998 INFO exited: cloud_sql_proxy (exit status 0; not expected)
2018-10-14 15:51:47,999 INFO gave up: cloud_sql_proxy entered FATAL state, too many start retries too quickly
Finally I managed to figure out a working solution, and here is it:
Create a new file /root/start_cloud_sql_proxy.sh:
#!/bin/bash
/root/cloud_sql_proxy -dir=/cloudsql -instances="project_id:us-central1:instance-name" -credential_file="/root/service-account.json"
Under /etc/supervisor/conf.d/cloud_sql_proxy.conf, change the command to execute a bash file:
command=/root/start_cloud_sql_proxy.sh

Airflow: Celery task failure

I have airflow up and running but I have an issue where my task is failing in celery.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
subprocess.check_call(command, shell=True)
File "/usr/local/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run airflow_tutorial_v01 print_hello 2017-06-01T15:00:00 --local -sd /usr/local/airflow/dags/hello_world.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
it is a very basic DAG (taken from the hello world tutorial: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/tutorial.py).
Also I do not see any logs of my worker, I got this stack strace from the Flower web interface.
If I run manually on the worker node, the airflow run command mentionned in the stack trace it works.
How can I get more information to debug further?
The only log I get when starting `airflow work` is
root#ip-10-0-4-85:~# /usr/local/lib/python3.5/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
[2018-07-25 17:49:43,430] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
[2018-07-25 17:49:43,469] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
[2018-07-25 17:49:43,594] {__init__.py:45} INFO - Using executor CeleryExecutor
Starting flask
[2018-07-25 17:49:43,665] {_internal.py:88} INFO - * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit)
^C
The config I use is the default one with a postgresql and redis backend for celery.
I see the worked online in Flower.
Thanks.
edit: edited for more informations

Worker Lost Error after task accepted with exitcode 2

I am new to celery. I have added the backup script to celery using periodic_task. From the logs I see that "Task accepted: main" after immediately I see the below error in the logs.
[2017-09-21 06:01:00,257: ERROR/MainProcess] Process 'PoolWorker-5' pid:XXXX
exited with 'exitcode 2'
[2017-09-21 06:01:00,268: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 2.',)
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 2.
Thanks in advance.

Multiple celerycam procceses on the same server

I have two celerycam processes configured for running under supervisor. Here is part of my supervisord.conf:
[program:dev1_celerycam]
directory = /var/www/dev1.example.com
command = /usr/bin/python2.7 /var/www/dev1.example.com/manage.py celerycam --logfile=/var/log/supervisor/dev1_celerycam.log --workdir=/var/www/dev1.example.com
stderr_logfile = /var/log/supervisor/dev1_celerycam_error.log
stdout_logfile = /var/log/supervisor/dev1_celerycam.log
exitcodes=0,2
priority=993
[program:dev_celerycam]
directory = /var/www/dev.example.com
command = /usr/bin/python2.7 /var/www/dev.example.com/manage.py celerycam --logfile=/var/log/supervisor/dev_celerycam.log --workdir=/var/www/dev.example.com
stderr_logfile = /var/log/supervisor/dev_celerycam_error.log
stdout_logfile = /var/log/supervisor/dev_celerycam.log
exitcodes=0,2
priority=995
Also I have two processes of celeryd in the supervisord.conf. They starts perfectly fine on the same server. But for one of celerycam processes I get next in supervisord.log:
2013-09-01 09:35:12,546 INFO exited: dev_celerycam (exit status 1; not expected)
2013-09-01 09:35:12,546 INFO received SIGCLD indicating a child quit
2013-09-01 09:35:15,555 INFO spawned: 'dev_celerycam' with pid 25504
2013-09-01 09:35:16,540 INFO exited: dev_celerycam (exit status 1; not expected)
2013-09-01 09:35:16,540 INFO received SIGCLD indicating a child quit
2013-09-01 09:35:17,542 INFO gave up: dev_celerycam entered FATAL state, too many start retries too quickly
This occurs for dev_celerycam or dev1_celerycam on supervisord restart. One of them starts fine while another one fails. Looks like it happens randomly.
Is there any chance to get both celerycam processes working?
Both celerycam process created pid file in the same path somehow. Had to add --pidfile parameter for each of celerycam processes.

sun gridengine error "shepherd of job 119232.1 exited with exit status = 26"

We use gridengine(extactly open grid scheduler 2011.11.p1) as batch-queuing system. Just now I added an execd host named host094, but when jobs were submitted there, errors issued, status of job is Eqw, logs in $SGE_ROOT/default/spool/host094/messages says:
shepherd of job 119232.1 exited with exit status = 26
can't open usage file active_jobs/119232.1/usage for job 119232.1: No such file or directory
What's the meaning?