Can a sleeping Perl program be killed using kill(SIGKILL, $$)? - perl

I am running a Perl program, there is a module in the program which is triggered by an external process to kill all the child processes and terminate its execution.
This works fine.
But, when a certain function say xyz() is executing there is a sleep(60) statement on a condition.
Right now the function is executed repeatedly as it is waiting for some value.
When I trigger the kill process as mentioned above the process does not take place.
Does anybody have a clue as to why this is happening?

I don't understand how you are trying to kill a process from within itself (your $$ in question subject) when it's sleeping.
If you are killing from a DIFFERENT process, then it will have its own $$. You need to find out the PID of the original process to kill first (by trolling process list or by somehow communicating it from the original process).
Killing a sleeping process works very well
$ ( date ; perl5.8 -e 'sleep(100);' ; date ) &
Wed Sep 14 09:48:29 EDT 2011
$ kill -KILL 8897
Wed Sep 14 09:48:54 EDT 2011
This also works with other "killish" signals ('INT', 'ABRT', 'QUIT', 'TERM')
UPDATE: Upon re-reading, may be the issue you meant was that "triggered by an external process" part doesn't happen. If that's the case, you need to:
Set up a CATCHABLE signal handler in your process before going to sleep ($SIG{'INT'}) - SIGKILL can not be caught by a handler.
Send SIGINT from said "external process"
Do all the needed cleanup once sleep() is interrupted by SIGINT from SIGINT handler.

Related

Parallel h5py - The MPI_Comm_dup() function was called before MPI_INIT was invoked

I am experiencing the below issue with Parallel h5py on MacOS Ventura 13.0.1, on a 2020 MacBook Pro 4-Core Intel Core i7.
I installed h5py and dependencies, by following both of these docs and this guide.
Running a job which requires only mpi4py runs and finishes without any issues. The problem comes when I try to run a job which requires Parallel h5py, e.g. trying out this code.
I get back the following:
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20469,1],3]
Exit code: 1
--------------------------------------------------------------------------
I found this GitHub issue, but it didn't help in my case.
I should also point out that I managed to install and use Parallel h5py on a MacBook Air with MacOS Monterey, though that one is only dual-core, so it doesn't allow me to test Parallel h5py with as many cores, without using -overcommit.
Since I have not found any ideas how to resolve this, apart from the above GitHub issue, I would appreciate any suggestions.

Celery lose worker

I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?
WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...

Solaris svcs command shows wrong status

I have freshly installed an application on solaris 5.10 . When checked through ps -ef | grep hyperic | grep agent, process are up and running . When checked the status through svcs hyperic-agent command, the output shows that the agent is in maintenance mode . Application is working fine and I dont have any issues with the application . Please help
There are several reasons that lead to that behavior:
Starter (start/exec property of service) returned status that is different from SMF_EXIT_OK (zero). Than you may check logs:
# svcs -x ssh
...
See: /var/svc/log/network-ssh:default.log
If you check logs, you may see following messages that means, starter script failed or incorrectly written:
[ Aug 11 18:40:30 Method "start" exited with status 96 ]
Another reason for such behavior is that service faults during while its working (i.e. one of processes coredumps or receives kill signal or all processes exits) as described here: https://blogs.oracle.com/lianep/entry/smf_5_fault_retry_models
The actual system that provides SMF facilities for monitoring that is System Contracts. You may determine contract ID of online service with svcs -v (field CTID):
# svcs -vp svc:/network/smtp:sendmail
STATE NSTATE STIME CTID FMRI
online - Apr_14 68 svc:/network/smtp:sendmail
Apr_14 1679 sendmail
Apr_14 1681 sendmail
Than watch events with ctwatch:
# ctwatch 68
CTID EVID CRIT ACK CTTYPE SUMMARY
68 28 crit no process contract empty
Than there are two options to handle that:
There is a real problem with service so it eventually faults. Than debug the application.
It is normal behavior of service, so you should edit and re-import your service manifest, to make SMF less paranoid. I.e. configure ignore_error and duration properties.

Why doesn't my Perl script start under FastCGI?

I am getting this error in error_log one of my Perl CGI application. I am pretty sure I haven't changed my script at all and all of a sudden i have started getting this error.
This is what I see in error_log:
[Wed Jul 8 15:18:20 2009] [warn] FastCGI: server "/local/web/test/cgi-bin/test.pl" (pid 17033)
terminated by calling exit with status '255'
[Wed Jul 8 15:18:20 2009] [warn] FastCGI: server "/local/web/test/cgi-bin/test.pl"
has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds
(The snippet was edited for clarity)
Also, the AddHandler line for FastCGI exists in the config file.
Can any tell me the possible reasons for this error?
There is nothing recorded in Apache logs.
Here are two tips that might help (assuming your app is respecting the fastcgi protocol):
1. try to run the app in command line, this prove that you have execute bit and no compile errors in code.
2. check suexec.log of your apache server, this might show user/group or other errors related to your script.
You could try redirecting the STDERR from your Perl script, something like:
BEGIN { open STDERR, '>stderr.log' }
If your stderr.log file doesn't get created at all, it means that the script didn't even get executed, likely a suexec/permissions problem. Otherwise, you should have whatever went wrong with the Perl script in that file.

how to get the pid of a process which has sent a SIGABRT signal to another process which has exited dumping core

how to get the pid of a process which has sent a SIGABRT signal to another process which has exited dumping core
In short, by installing a signal handler for SIGABRT. More specifically, if you specify the SA_SIGINFO flag when installing the signal handler, then the siginfo_t structure should be populated with extra info about the signal, including the sender's PID etc.