when a process can't start in the beginning,supervisorctl update will hang - supervisord

In a case,when a process can't start in the beginning,eg the binary defined in the conf doesn't exist or conf error,the program can't start to running state,still stay in backoff state.
ex_sample_1.1.1.0.223817.486782 BACKOFF can't find command '/home/admin/ex_sample_1.1.1.0.223817.486782/ex_sample'
then if we remove the supervisor conf,run sudo supervisorctl update,the command will hang forever。
I have read the supervisor code,but still have a little confused,maybe I stoll doesn't catch the point. when I input sudo supervisorctl update after I remove the BACKOFF program supervisor conf, supervisorctl will call stopPorcessGroup, send this rpc all to supervisord http server.
results = supervisor.stopProcessGroup(gname)
and it hang,why stopProcessGroup has no timeout? When the rpc all to supervisord service,supervisord service will call killall function,I thought it shoud be call once, if killall function failed,it will return to supervisorctl, why hang here?
killall = make_allfunc(processes, isRunning, self.stopProcess,
wait=wait)
killall.delay = 0.05
killall.rpcinterface = self
return killall # deferred
In the supervisord log, I found the killall function is called repeated, and no return to supervisorctl that the reason supervisorctl update hang, but I didn't find the code which call the killall function repeatedly. How and why is killall function called repeatedly, and how to solve it?

Related

kubernate microk8s dashboard-proxy timed out waiting for the condition on deployments/kubernetes-dashboard

I am having the time out issue while starting up my cluster dashboard with
microk8s dashboard-proxy
This issue occurs again and again in my kubernate cluster. I don't really know the cause of it.
error: timed out waiting for the condition on deployments/kubernetes-dashboard
Traceback (most recent call last):
File "/snap/microk8s/1609/scripts/wrappers/dashboard-proxy.py", line 50, in <module>
dashboard_proxy()
File "/snap/microk8s/1609/scripts/wrappers/dashboard-proxy.py", line 13, in dashboard_proxy
check_output(command)
File "/snap/microk8s/1609/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/snap/microk8s/1609/usr/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['microk8s.kubectl', '-n', 'kube-system', 'wait', '--timeout=240s', 'deployment', 'kubernetes-dashboard', '--for', 'condition=available']' returned non-zero exit status 1
I am running a kubernate cluster on vagrant machines(Centos) using microk8s. This issue can be cause by many raisons. Bellow is some of them;
Leak of memory
To fix: you need to increase memory within you vagrant machines settings
some thing went wrong during you microk8s installation
To fix : remove microk8s and reinstall it. In my case I use snap to install it. This is how I did
snap remove microk8s --purge
snap install microk8s --classic --channel=1.18/stable
some time you need to kill the process and restart you vagrant machine. In my case I did
lsof -Pi:10443 where 10443 on which my dashboard is running
kill -9 xxxx where xxxx is the PID retrieved from the previous command

Celery lose worker

I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?
WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...

Celery task STARTED permanantly (not retried)

We use a celery worker in a docker instance. If the docker instance is killed (docker could be changed and brought back up) we need to retry the task. My task currently looks like this:
#app.task(bind=True, max_retries=3, default_retry_delay=5, task_acks_late=True)
def build(self, config, import_data):
build_chain = chain(
build_dataset_docstore.s(config, import_data),
build_index.s(),
assemble_bundle.s()
).on_error(handle_chain_error.s())
return build_chain
#app.task(bind=True, max_retries=3, default_retry_delay=5, task_acks_late=True)
def build_dataset_docstore(self, config, import_data):
# do lots of stuff
#app.task(bind=True, max_retries=3, default_retry_delay=5, task_acks_late=True)
def build_index(self, config, import_data):
# do lots of stuff
#app.task(bind=True, max_retries=3, default_retry_delay=5, task_acks_late=True)
def assemble_bundle(self, config, import_data):
# do lots of stuff
To imitate the container being restarted (worker being killed) I'm running the following script:
SLEEP_FOR=1
echo "-> Killing worker"
docker-compose-f docker/docker-compose-dev.yml kill worker
echo "-> Waiting $SLEEP_FOR seconds"
sleep $SLEEP_FOR
echo "-> Bringing worker back to life"
docker-compose-f docker/docker-compose-dev.yml start worker
Looking in flower i see the task is STARTED... cool, but...
why isnt it retried?
Do i need to handle this circumstance manually?
if so, what is the correct way to do this?
EDIT:
from the docs:
If the worker won’t shutdown after considerate time, for being stuck in an infinite-loop or similar, you can use the KILL signal to force terminate the worker: but be aware that currently executing tasks will be lost (i.e., unless the tasks have the acks_late option set).
I'm using the acks late option, so why isn't this retrying?
The issue here seems to be task_acks_late (https://docs.celeryproject.org/en/latest/userguide/configuration.html#task-acks-late), which i assume is a param for the celery app, on the task.
I updated task_acks_late to acks_late and added reject_on_worker_lost and this functions as expected.
Thus:
#app.task(bind=True, max_retries=3, default_retry_delay=5, acks_late=True, reject_on_worker_lost=True)
def assemble_bundle(self, config, import_data):
# do lots of stuff

Issues with running two instances of searchd

I have just updated our Sphinx server from 1.10-beta to 2.0.6-release, and now I have run into some issues with searchd. Previously we were able to run two instances of searchd next to each other by specifying two different config-files, i.e:
searchd --config /etc/sphinx/sphinx.conf
searchd --config /etc/sphinx/sphinx.staging.conf
sphinx.conf listens to 9306:mysql41, and 9312, while sphinx.staging.conf listens to 9307:mysql41 and 9313.
After we updated to 2.0.6 however, a second instance is never started. Or rather.. the output makes it seem like it starts, and a pid-file is created etc. But for some reason only the first searchd instance keeps running, and the second seems to shutdown right away. So while trying to run searchd --config /etc/sphinx/sphinx.conf twice (if that was the first one started) complains that the pid-file is in use, trying to run searchd --config /etc/sphinx/sphinx.staging.conf (if that is the second started instance) "starts" the daemon again and again, only no new process is created..
Note that if I switch these commands around when first creating the process, then sphinx.conf is the instance not really started.
I have checked, and rechecked, that these ports are only used by searchd.
Does anyone have any idea of what I can do/try next? I've installed it from source on ubuntu 10.04 LTS with:
./configure --prefix /etc/sphinx --with-mysql --enable-id64 --with-libstemmer
make -j4 install
Note to self: Check the logs!
RT-indices use binary logs to enable crash recovery. Since my old config files did not specify a path for where these should be stored, both instances of searchd tried to write to the same binary logs. The instance started last was of course not permitted to manipulate these files, and thus exited with a fatal error:
[Fri Nov 2 17:13:32.262 2012] [ 5346] FATAL: failed to lock
'/etc/sphinx/var/data/binlog.lock': 11 'Resource temporarily unavailable'
[Fri Nov 2 17:13:32.264 2012] [ 5345] Child process 5346 has been finished,
exit code 1. Watchdog finishes also. Good bye!
The solution was simple, ensure to specify a binlog_path inside the searchd configuration section of each configuration file:
searchd
{
[...]
binlog_path = /path/to/writable/directory
[...]
}

Can a sleeping Perl program be killed using kill(SIGKILL, $$)?

I am running a Perl program, there is a module in the program which is triggered by an external process to kill all the child processes and terminate its execution.
This works fine.
But, when a certain function say xyz() is executing there is a sleep(60) statement on a condition.
Right now the function is executed repeatedly as it is waiting for some value.
When I trigger the kill process as mentioned above the process does not take place.
Does anybody have a clue as to why this is happening?
I don't understand how you are trying to kill a process from within itself (your $$ in question subject) when it's sleeping.
If you are killing from a DIFFERENT process, then it will have its own $$. You need to find out the PID of the original process to kill first (by trolling process list or by somehow communicating it from the original process).
Killing a sleeping process works very well
$ ( date ; perl5.8 -e 'sleep(100);' ; date ) &
Wed Sep 14 09:48:29 EDT 2011
$ kill -KILL 8897
Wed Sep 14 09:48:54 EDT 2011
This also works with other "killish" signals ('INT', 'ABRT', 'QUIT', 'TERM')
UPDATE: Upon re-reading, may be the issue you meant was that "triggered by an external process" part doesn't happen. If that's the case, you need to:
Set up a CATCHABLE signal handler in your process before going to sleep ($SIG{'INT'}) - SIGKILL can not be caught by a handler.
Send SIGINT from said "external process"
Do all the needed cleanup once sleep() is interrupted by SIGINT from SIGINT handler.