When using pycaffe to run solver.solve(), only one iteration is executed, then the current process is killed - neural-network

I am using pycaffe to do a multilabel classification task. When I run solver.slove() or solver.step(2), only one iteration is executed, then the current process is killed somehow. ipython console says the kernel died unexpectedly. No other error information is provided.
Then, I use terminal to run the command "python Test.py", and get the "Floating point exception (core dumped)" information.
Besides, the net.forward() and net.backward() methods are all ok.
What is the reason? And how to solve the problem?

Related

Manually Stop Jupyter Kernel and Prevent from Restarting

Background
I have created a Jupyter kernel A from which I launch another kernel B. I am doing this in order to audit the execution of kernel B. So when a user selects kernel A from the interface, kernel B is launched in the background which then executes the notebook code. strace is being used to audit the execution. After the audit phase, code, data, and provenance etc. of the program execution are recorded and stored for analysis later on.
Problem
After the notebook program ends, I intend to stop tracing the execution of kernel B. This does not happen unless I stop the execution of kernel B launched internally by kernel A. The only way I have been able to do this is using the kill command as such:
os.kill(os.getpid(), 9)
This does the job but with a side-effect: Jupyter restarts the kernel automatically which means kernel A and B are launched and start auditing the execution again. This causes certain race conditions and overwrites of some files which I want to avoid.
Possible Solution
To my mind, there are two things I can do to resolve this issue:
Exit the kernel B program gracefully so the auditing of the notebook code gets completed and stored. This does not happen with the kill command so would need some other solution
Avoid automatic restart of the kernel, with or without the kill command.
I have looked into different ways to achieve the above two but have not been successful yet. Any advice on achieving either of the above two solutions would be appreciated, or perhaps another way of solving the problem.
have you tried terminating kernel B instead of killing it
using 15 instead of 9
os.kill(os.getpid(), signal.SIGTERM)
#or
os.kill(os.getpid(), 15)
other values for kill are signal.SIGSTOP=23 , signal.SIGHUP=1
Other option could be to insert following code snippet on top of the code
import signal
import sys
def onSigExit():
sys.exit(0)
#os._exit(0)
signal.signal(signal.SIGINT, onSigExit)
signal.signal(signal.SIGTERM, onSigExit)
now you should be able to send
os.kill(os.getpid(),15)
and it should exit gracefully and not restart

dotMemory command line scheduled snapshots

I'm running dotMemory command line against an IoT Windows Forms application which requires many hours of tests on a custom appliance.
My purpose is to get memory snapshots on a time basis, while the application is running on the appliance. For example, if the test is designed to run for 24h, I want to get a 10 seconds memory snapshot each hour.
I found 2 ways of doing it:
Run dotMemory.exe and get a standalone snapshot on a time basis, by using schtasks to schedule each execution;
Run dotMemory using the attach and trigger arguments and get all the snapshots on a single file.
The first scenario it's ready for me, but as it is easy to see, the second one is much better for further analysis after collecting the data.
I'm able to start it by using a command just like:
C:\dotMemory\dotMemory.exe attach $processId --trigger-on-activation --trigger-timer=10s --trigger-max-snapshots=24 --trigger-delay=3600s --save-to-dir=c:\dotMemory\Snapshots
Here comes my problem:
How can I make the command/process stop after it reaches the max-snapshot value without any human intervention?
Reference: https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html
If you start your app under profiling instead of attaching to the already running process, stopping the profiling session will kill the app under profiling. You can stop profiling session by passing ##dotMemory["disconnect"] command to the dotMemory console stdin. (E.g. some script can do that after some time).
See dotmemory help service-messages for details
##dotMemory["disconnect"] Disconnect profiler.
If you started profiling with 'start*' commands, the profiled process will be killed.
If you started profiling with 'attach' command, the profiler will detach from the process.
P.S.
Some notes about your command line. With this comand line dotMemory will get a snapshot each 10 seconds but will start to do it after one hour. There is no such thing as "10 seconds memory snapshot" memory snapshot is a momentary snapshot of an object graph in the memory. Right command line for your task will be C:\dotMemory\dotMemory.exe attach $processId --trigger-on-activation --trigger-timer=1h --trigger-max-snapshots=24 --save-to-dir=c:\dotMemory\Snapshots

Matlab/Simulink: run batch of simulations in parallel?

I have to run a series of simulations and save the results. Since by default Matlab only uses one core, I wonder if it is possible to open multiple worker tasks and assign different simulation runs to them?
You could run each simulation in a separate MATLAB instance and let the OS handle the process to core assignment.
One master MATLAB could synchronize each child instances checking for example if simulation results file are existing.
I aso have the same problem but I did not manage to really understand how to make it in MatLab. The documentation in matlab is too advanced to get to know how to make it.
Since I am working with Ubuntu I find a way to do the work calling the unix command from MatLab and using the parallel GNU command
So I mange to run my simulation in parallel with 4 cores.
unix('parallel --progress -j4 flow > /dev/null :::: Pool.txt','-echo')
you can find more info in the link
Shell, run four processes parallel
Details of the syntaxis can be found at https://www.gnu.org/software/parallel/
but breifly I can tell you
--progress shows a status of the progress
-j4 tells the amount or jobs in parallel you want to have
flow is the name of my simulator
/dev/null was just to avoid the screen run output of the simulator to show up
Pool.txt is a file I made with the required simulator input that is basically the path and the main simulator file.
echo I do not remember now what was it for :D

Detect errors with torque and grid engine and prevent execution of dependent tasks

I have a shell script that queues multiple tasks for execution on an HPC cluster. The same job submission script works for either torque or grid engine with some minor conditional logic. This is a pipeline where the output of earlier tasks are fed to later tasks for further processing. I'm using qsub to define job dependencies, so later tasks wait for earlier tasks to complete before starting execution. So far so good.
Sometimes, a task fails. When a failure happens, I don't want any of the dependent tasks to attempt processing the output of the failed task. However, the dependent tasks have already been queued for execution long before the failure occurred. What is a good way to prevent the unwanted processing?
You can use the afterok dependency argument. For example, the qsub command may look like:
qsub -w depend=afterok:<jobid> submit.pbs
Torque will only start the next job if the jobid exits without errors. See documentation on the Adaptive Computing page.
Here is what I eventually implemented. The key to making this work is returning error code 100 on error. Sun Grid Engine stops execution of subsequent jobs upon seeing error code 100. Torque stops execution of subsequent jobs upon seeing any non-zero error code.
qsub starts a sequence of bash scripts. Each of those bash scripts has this code:
handleTrappedErrors()
{
errorCode=$?
bashCommand="$BASH_COMMAND"
scriptName=$(basename $0)
lineNumber=${BASH_LINENO[0]}
# log an error message to a log file here -- not shown
exit 100
}
trap handleErrors ERR
Torque (as Derek mentioned):
qsub -W depend=afterok:<jobid> ...
Sun Grid Engine:
qsub -hold_jid <jobid> ...

Matlab process termination in slurm

I have two questions that to me seem related:
First, is it necessary to explicitly terminate Matlab in my sbatch command? I have looked through several online slurm tutorials, and in some cases the authors include an exit command:
http://www.umbc.edu/hpcf/resources-tara-2013/how-to-run-matlab.html
And in some they don't:
http://www.buffalo.edu/ccr/support/software-resources/compilers-programming-languages/matlab/PCT.html
Second, when creating a parallel pool in a job, I almost always get the following warning:
Warning: Found 4 pre-existing communicating job(s) created by pool that are
running, and 2 communicating job(s) that are pending or queued. You can use
'delete(myCluster.Jobs)' to remove all jobs created with profile local. To
create 'myCluster' use 'myCluster = parcluster('local')'
Why is this happening, and is there any way to avoid it happening to myself and to others because of me?
It depends on how you launch Matlab. Note that your two examples use distinct methods for running a matlab script; the first one uses the -r option
matlab -nodisplay -r "matrixmultiply, exit"
while the second one uses stdin redirection from a file
matlab < runjob.m
In the first solution, the Matlab process will be left running after the script is finished, that is why the exit command is needed there. In the second solution, the Matlab process is terminated as stdin closes when the end of the file is reached.
If you do not end the matlab process, Slurm will kill it when the maximum allocation time is reached, as defined by the --time option in you submission script or by the default cluster (or partition) value.
To avoid the warning you mention, make sure to systematically use matlabpool close at the end of your job. If you have several instances of Matlab running on the same node, and you have a shared home directory, you will probably get the warning anyhow, as I believe the information about open matlab pools is stored in a hidden folder in your home. Rebooting will probably not help, but finding those files and removing them will (be careful though and ask the system administrator).
to avoid your warning, you have to delete
.matlab/local_cluster_jobs/
directory