Supervisord: Don't restart subrocesses on failure - supervisord

How to make supervsiord process always stop when one of it's managed subrocesses stops/crashes?
Regards.

The solution was to add
autorestart=false
to the relevant subprocess:
[program:x] Section

Related

Celery - handle WorkerLostError exception with Task.retry()

I'm using celery 4.4.7
Some of my tasks are using too much memory and are getting killed with SIGTERM 9. I would like to retry them later since I'm running with concurrency on the machine and they might run OK again.
However, as far as I understand you can't catch WorkerLostError exception thrown within a task i.e. this won't won't work as I expect:
from billiard.exceptions import WorkerLostError
#celery_app.task(acks_late=True, max_retries=2, autoretry_for=(WorkerLostError,))
def some_task():
#task code
I also don't won't to use task_reject_on_worker_lost as it makes the tasks requeued and max_retries is not applied.
What would be the best approach to handle my use case?
Thanks in advance for your time :)
Gal

boofuzz - Target connection reset, skip error

I am using boofuzz to try to fuzz a specific application. While creating the blocks etc and some testing i noticed that the target sometimes closes the connection. This causes procmon to terminate the target process and restarts it. However this is totally unnecessary for this target.
Can i somehow tell boofuzz to not handle this as an Error (so target is not restarted)
[2017-11-04 17:09:07,012] Info: Receiving...
[2017-11-04 17:09:07,093] Check Failed: Target connection reset.
[2017-11-04 17:09:07,093] Test Step: Calling post_send function:
[2017-11-04 17:09:07,093] Info: No post_send callback registered.
[2017-11-04 17:09:07,093] Test Step: Sleep between tests.
[2017-11-04 17:09:07,094] Info: sleeping for 0.100000 seconds
[2017-11-04 17:09:07,194] Test Step: Contact process monitor
[2017-11-04 17:09:07,194] Check: procmon.post_send()
[2017-11-04 17:09:07,196] Check OK: No crash detected.
Excellent question! There isn't (wasn't) any way to do this, but there really should be. A reset connection does not always mean a failure.
I just added ignore_connection_reset and ignore_connection_aborted options to the Session class to ignore ECONNRESET and ECONNABORTED errors respectively. Available in version 0.0.10.
Description of arguments available in the docs: http://boofuzz.readthedocs.io/en/latest/source/Session.html
You may find the commit that added these arguments informative for how some of the boofuzz internals work (relevant lines 182-183, 213-214, 741-756): https://github.com/jtpereyda/boofuzz/commit/a1f08837c755578e80f36fd1d78401f21ccbf852
Thank you for the solid question.

specman error - Specman run reached the tick_max configuration limit

I am running a specman environment and I get the error:
"Specman run reached the tick_max configuration limit(10000) without a call to stop_run() "
How can I debug this error?
I guess I have some kind of an unresolved loop with no time progress. How can I find the place (file and row number) where the simulator is stuck?
I have tried to use break on error - but it does nothing...
Thanks
In addition to the previous comment about progressing of the simulation, you may want to try increasing the no. of ticks by the following command:
config run -tick_max=MAX_INT
and then hitting the interrupt button or Ctrl C when simualtion stop progressing and check the stack in specman debugger.

Titan server start fails

I see the following message when starting titan-server:
Caused by: InvalidRequestException(why:Keyspace names must be case-insensitively unique ("titan" conflicts with "titan"))
at org.apache.cassandra.thrift.Cassandra$system_add_keyspace_result.read(Cassandra.java:33158)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_system_add_keyspace(Cassandra.java:1408)
at org.apache.cassandra.thrift.Cassandra$Client.system_add_keyspace(Cassandra.java:1395)
at com.netflix.astyanax.thrift.ThriftClusterImpl$9.internalExecute(ThriftClusterImpl.java:250)
at com.netflix.astyanax.thrift.ThriftClusterImpl$9.internalExecute(ThriftClusterImpl.java:247)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
... 26 more
Here is how i start it.
~/titan-server-0.4.0$ bin/titan.sh -c cassandra-es start
What am i missing? Thanks for any help.
Of course, After I run
titan.sh -c cassandra-es clean
and it starts up just fine. Does that mean there was something wrong with my data.
you can directly start titan by
bin/titan.sh start

What am I doing wrong for Sphinx to fail to start during cap deploy?

I'm struggling to get Sphinx back up and running after deploying a rails app to my VPS.
Specifically, I'm thrown this error:
** [out :: myapp.com] => Mixing in Lockdown version: 1.6.4
** [out :: myapp.com]
** [out :: myapp.com] Failed to start searchd daemon. Check /var/www/myapp/releases/20100227224936/log/searchd.log.
** [out :: myapp.com] Failed to start searchd daemon. Check /var/www/myapp/releases/20100227224936/log/searchd.log
However, a log file isn't created!
This is the deploy.rb I am using (with thanks to Updrift :) )
namespace :deploy do
desc "Restart the app"
task :restart, :roles => :app do
# This regen's the config file, stops Sphinx if running, then starts it.
# No indexing is done, just a restart of the searchd daemon
# thinking_sphinx.running_start
# The above does not re-index. If any of your define_index blocks
# in your models have changed, you will need to perform an index.
# If these are changing frequently, you can use the following
# in place of running_start
thinking_sphinx.stop
thinking_sphinx.index
thinking_sphinx.start
# Restart the app
run "touch #{current_path}/tmp/restart.txt"
end
desc "Cleanup older revisions"
task :after_deploy do
cleanup
end
end
I'm using the Thinking Sphinx gem, v 1.3.16, passenger 2.2.10. Any thoughts you have would be greatly appreciated.
Many thanks!
Greg
UPDATE: Further to some more google searching, I've found a couple of other people with similar errors - seemingly related to port listening errors eg here and [I'm not allowed to link to the other one]. My production.sphinx.conf similarly has used port 9312, despite me specifying in sphinx.yml to use 3312.
Does anyone have any idea what might be causing this? Thanks.
I should have rung up the IT Crowd: "Have you tried turning it off and on again?"
http://groups.google.com/group/thinking-sphinx/browse_thread/thread/dde565ea40e31075
Rebooting the server released the port.