Parallel h5py - The MPI_Comm_dup() function was called before MPI_INIT was invoked - h5py

I am experiencing the below issue with Parallel h5py on MacOS Ventura 13.0.1, on a 2020 MacBook Pro 4-Core Intel Core i7.
I installed h5py and dependencies, by following both of these docs and this guide.
Running a job which requires only mpi4py runs and finishes without any issues. The problem comes when I try to run a job which requires Parallel h5py, e.g. trying out this code.
I get back the following:
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_dup() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[...] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20469,1],3]
Exit code: 1
--------------------------------------------------------------------------
I found this GitHub issue, but it didn't help in my case.
I should also point out that I managed to install and use Parallel h5py on a MacBook Air with MacOS Monterey, though that one is only dual-core, so it doesn't allow me to test Parallel h5py with as many cores, without using -overcommit.
Since I have not found any ideas how to resolve this, apart from the above GitHub issue, I would appreciate any suggestions.

Related

Celery - handle WorkerLostError exception with Task.retry()

I'm using celery 4.4.7
Some of my tasks are using too much memory and are getting killed with SIGTERM 9. I would like to retry them later since I'm running with concurrency on the machine and they might run OK again.
However, as far as I understand you can't catch WorkerLostError exception thrown within a task i.e. this won't won't work as I expect:
from billiard.exceptions import WorkerLostError
#celery_app.task(acks_late=True, max_retries=2, autoretry_for=(WorkerLostError,))
def some_task():
#task code
I also don't won't to use task_reject_on_worker_lost as it makes the tasks requeued and max_retries is not applied.
What would be the best approach to handle my use case?
Thanks in advance for your time :)
Gal

Celery lose worker

I use celery 4.4.0 version in my project(Ubuntu 18.04.2 LTS). When i raise Exception('too few functions in features to classify') , celery project lost worker and i get such logs:
[2020-02-11 15:42:07,364] [ERROR] [Main ] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.')
Traceback (most recent call last):
File "/var/lib/virtualenvs/simus_classifier_new/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
[2020-02-11 15:42:07,474] [DEBUG] [ForkPoolWorker-61] Closed channel #1
Do you have any idea how to solve this problem?
WorkerLostError are almost like OutOfMemory errors - they can't be solved. They will continue to happen from time to time. What you should do is to make your task(s) idempotent and let Celery retry tasks that failed due to worker crash.
It sounds trivial, but in many cases it is not. Not all tasks can be idempotent for an example. Celery still has bugs in the way it handles WorkerLostError. Therefore you need to monitor your Celery cluster closely and react to these events, and try to minimize them. In other words, find why the worker crashed - Was it killed by the system because it was consuming all the memory? Was it killed simply because it was running on an AWS spot instance, and it got terminated? Was it killed by someone executing kill -9 <worker pid>? All these circumstances could be handled this way or another...

boofuzz - Target connection reset, skip error

I am using boofuzz to try to fuzz a specific application. While creating the blocks etc and some testing i noticed that the target sometimes closes the connection. This causes procmon to terminate the target process and restarts it. However this is totally unnecessary for this target.
Can i somehow tell boofuzz to not handle this as an Error (so target is not restarted)
[2017-11-04 17:09:07,012] Info: Receiving...
[2017-11-04 17:09:07,093] Check Failed: Target connection reset.
[2017-11-04 17:09:07,093] Test Step: Calling post_send function:
[2017-11-04 17:09:07,093] Info: No post_send callback registered.
[2017-11-04 17:09:07,093] Test Step: Sleep between tests.
[2017-11-04 17:09:07,094] Info: sleeping for 0.100000 seconds
[2017-11-04 17:09:07,194] Test Step: Contact process monitor
[2017-11-04 17:09:07,194] Check: procmon.post_send()
[2017-11-04 17:09:07,196] Check OK: No crash detected.
Excellent question! There isn't (wasn't) any way to do this, but there really should be. A reset connection does not always mean a failure.
I just added ignore_connection_reset and ignore_connection_aborted options to the Session class to ignore ECONNRESET and ECONNABORTED errors respectively. Available in version 0.0.10.
Description of arguments available in the docs: http://boofuzz.readthedocs.io/en/latest/source/Session.html
You may find the commit that added these arguments informative for how some of the boofuzz internals work (relevant lines 182-183, 213-214, 741-756): https://github.com/jtpereyda/boofuzz/commit/a1f08837c755578e80f36fd1d78401f21ccbf852
Thank you for the solid question.

Starting Parpool in MATLAB

I tried starting parpool in MATLAB 2015b. Command as follows,
parpool('local',3);
This command should allocate 3 workers. Whereas I received an error stating failure to start parpool. The error message as follows,
Error using parpool (line 94)
Failed to start a parallel pool. (For information in addition to
the causing error, validate the profile 'local' in the Cluster Profile
Manager.)
A similar query was posted in (https://nl.mathworks.com/matlabcentral/answers/196549-failed-to-start-a-parallel-pool-in-matlab2015a). I followed the same procedure, to validate the local profile as per the suggestions.
Using distcomp.feature( 'LocalUseMpiexec', false); or distcomp.feature( 'LocalUseMpiexec', true) in startup.m didn't create any improvement. Also when attempting to validate local profile still gives error message as follows,
VALIDATION DETAILS
Profile: local
Scheduler Type: Local
Stage: Cluster connection test (parcluster)
Status: Passed
Description:Validation Passed
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Job test (createJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 24 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job24.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: SPMD job test (createCommunicatingJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 25 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job25.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: Pool job test (createCommunicatingJob)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Parallel pool test (parpool)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
I am receiving these error only in my cluster machine. But launching parpool in my standalone PC is working perfectly. Is there a way to rectify this issue?

Can a sleeping Perl program be killed using kill(SIGKILL, $$)?

I am running a Perl program, there is a module in the program which is triggered by an external process to kill all the child processes and terminate its execution.
This works fine.
But, when a certain function say xyz() is executing there is a sleep(60) statement on a condition.
Right now the function is executed repeatedly as it is waiting for some value.
When I trigger the kill process as mentioned above the process does not take place.
Does anybody have a clue as to why this is happening?
I don't understand how you are trying to kill a process from within itself (your $$ in question subject) when it's sleeping.
If you are killing from a DIFFERENT process, then it will have its own $$. You need to find out the PID of the original process to kill first (by trolling process list or by somehow communicating it from the original process).
Killing a sleeping process works very well
$ ( date ; perl5.8 -e 'sleep(100);' ; date ) &
Wed Sep 14 09:48:29 EDT 2011
$ kill -KILL 8897
Wed Sep 14 09:48:54 EDT 2011
This also works with other "killish" signals ('INT', 'ABRT', 'QUIT', 'TERM')
UPDATE: Upon re-reading, may be the issue you meant was that "triggered by an external process" part doesn't happen. If that's the case, you need to:
Set up a CATCHABLE signal handler in your process before going to sleep ($SIG{'INT'}) - SIGKILL can not be caught by a handler.
Send SIGINT from said "external process"
Do all the needed cleanup once sleep() is interrupted by SIGINT from SIGINT handler.