I have a script that uses scikit-learn's parallel features (implemented by the joblib library). Typically I run it with higher verbosity, so that I can monitor the progress:
grid = GridSearchCV(estimator, params, cv=5, n_jobs=4, verbose=50)
When using a normal Python console, the messages from the Parallel library get printed to the console as they occur, like so:
[Parallel(n_jobs=4)]: Done 4 jobs | elapsed: 32.8s
[Parallel(n_jobs=4)]: Done 2 jobs | elapsed: 33.7s
However, when I'm running the script in an IPython notebook, it looks like these messages are getting buffered while the job runs, and only outputs after it has completed or I terminate the kernel.
Is there any way to get it to display in real time in the notebook?
Related
Background
I have created a Jupyter kernel A from which I launch another kernel B. I am doing this in order to audit the execution of kernel B. So when a user selects kernel A from the interface, kernel B is launched in the background which then executes the notebook code. strace is being used to audit the execution. After the audit phase, code, data, and provenance etc. of the program execution are recorded and stored for analysis later on.
Problem
After the notebook program ends, I intend to stop tracing the execution of kernel B. This does not happen unless I stop the execution of kernel B launched internally by kernel A. The only way I have been able to do this is using the kill command as such:
os.kill(os.getpid(), 9)
This does the job but with a side-effect: Jupyter restarts the kernel automatically which means kernel A and B are launched and start auditing the execution again. This causes certain race conditions and overwrites of some files which I want to avoid.
Possible Solution
To my mind, there are two things I can do to resolve this issue:
Exit the kernel B program gracefully so the auditing of the notebook code gets completed and stored. This does not happen with the kill command so would need some other solution
Avoid automatic restart of the kernel, with or without the kill command.
I have looked into different ways to achieve the above two but have not been successful yet. Any advice on achieving either of the above two solutions would be appreciated, or perhaps another way of solving the problem.
have you tried terminating kernel B instead of killing it
using 15 instead of 9
os.kill(os.getpid(), signal.SIGTERM)
#or
os.kill(os.getpid(), 15)
other values for kill are signal.SIGSTOP=23 , signal.SIGHUP=1
Other option could be to insert following code snippet on top of the code
import signal
import sys
def onSigExit():
sys.exit(0)
#os._exit(0)
signal.signal(signal.SIGINT, onSigExit)
signal.signal(signal.SIGTERM, onSigExit)
now you should be able to send
os.kill(os.getpid(),15)
and it should exit gracefully and not restart
I'm running dotMemory command line against an IoT Windows Forms application which requires many hours of tests on a custom appliance.
My purpose is to get memory snapshots on a time basis, while the application is running on the appliance. For example, if the test is designed to run for 24h, I want to get a 10 seconds memory snapshot each hour.
I found 2 ways of doing it:
Run dotMemory.exe and get a standalone snapshot on a time basis, by using schtasks to schedule each execution;
Run dotMemory using the attach and trigger arguments and get all the snapshots on a single file.
The first scenario it's ready for me, but as it is easy to see, the second one is much better for further analysis after collecting the data.
I'm able to start it by using a command just like:
C:\dotMemory\dotMemory.exe attach $processId --trigger-on-activation --trigger-timer=10s --trigger-max-snapshots=24 --trigger-delay=3600s --save-to-dir=c:\dotMemory\Snapshots
Here comes my problem:
How can I make the command/process stop after it reaches the max-snapshot value without any human intervention?
Reference: https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html
If you start your app under profiling instead of attaching to the already running process, stopping the profiling session will kill the app under profiling. You can stop profiling session by passing ##dotMemory["disconnect"] command to the dotMemory console stdin. (E.g. some script can do that after some time).
See dotmemory help service-messages for details
##dotMemory["disconnect"] Disconnect profiler.
If you started profiling with 'start*' commands, the profiled process will be killed.
If you started profiling with 'attach' command, the profiler will detach from the process.
P.S.
Some notes about your command line. With this comand line dotMemory will get a snapshot each 10 seconds but will start to do it after one hour. There is no such thing as "10 seconds memory snapshot" memory snapshot is a momentary snapshot of an object graph in the memory. Right command line for your task will be C:\dotMemory\dotMemory.exe attach $processId --trigger-on-activation --trigger-timer=1h --trigger-max-snapshots=24 --save-to-dir=c:\dotMemory\Snapshots
I have a test suite that I run with
python3 -mpytest --log-cli-level=DEBUG ...
on the build server. The live logs are useful to troubleshoot if the tests get stuck or are slow for some reason (the tests use external resources).
To speed things up, it is possible to run them with e.g.
python3 -mpytest -n 4 --log-cli-level=DEBUG ...
to have four parallel test runners. Speedup is almost linear with number of processes, which is great, but unfortunately the parent process swallows all live logs. I get the captured logs in case of a test failure, but I need the live logs as well to understand what is going on in real time. I understand that the output from all four parallel runs will be intermixed and that is fine. The purpose is for the committer to just check the build server output and know roughly what is going on.
I am currently using pytest-xdist, but use none of the more advanced features from it (just the multiprocessing).
I have a shell script that queues multiple tasks for execution on an HPC cluster. The same job submission script works for either torque or grid engine with some minor conditional logic. This is a pipeline where the output of earlier tasks are fed to later tasks for further processing. I'm using qsub to define job dependencies, so later tasks wait for earlier tasks to complete before starting execution. So far so good.
Sometimes, a task fails. When a failure happens, I don't want any of the dependent tasks to attempt processing the output of the failed task. However, the dependent tasks have already been queued for execution long before the failure occurred. What is a good way to prevent the unwanted processing?
You can use the afterok dependency argument. For example, the qsub command may look like:
qsub -w depend=afterok:<jobid> submit.pbs
Torque will only start the next job if the jobid exits without errors. See documentation on the Adaptive Computing page.
Here is what I eventually implemented. The key to making this work is returning error code 100 on error. Sun Grid Engine stops execution of subsequent jobs upon seeing error code 100. Torque stops execution of subsequent jobs upon seeing any non-zero error code.
qsub starts a sequence of bash scripts. Each of those bash scripts has this code:
handleTrappedErrors()
{
errorCode=$?
bashCommand="$BASH_COMMAND"
scriptName=$(basename $0)
lineNumber=${BASH_LINENO[0]}
# log an error message to a log file here -- not shown
exit 100
}
trap handleErrors ERR
Torque (as Derek mentioned):
qsub -W depend=afterok:<jobid> ...
Sun Grid Engine:
qsub -hold_jid <jobid> ...
I want to use some parallel features in Matlab.
And execute following command.
matlabpool open local 12;
batch(funcname,1,{arg},'PathDependencies',p,'Matlabpool',1);
Then all processes keep silent for the rest of time...
But without opening matlabpool. It would finish normally.
Is there any conflicts between the use of matlabpool and batch?
The matlabpool command runs a parallel job on the local scheduler to give you workers on which to run the body of your parfor loops and spmd blocks. This means that while matlabpool is open, the number of workers available to the local scheduler is reduced. Then, when you try to run a batch job, it can only run when there are workers free.
You can find out how many running jobs you have on your local scheduler either using the "job monitor" from the "Parallel" desktop menu item (your matlabpool session would show up there as a job in state running with 12 tasks), or by executing the following piece of code:
s = findResource( 'scheduler', 'Type', 'local' );
[pending, queued, running, finished] = findJob(s);
running
if you want to batch and parfor at the same time, open one less worker with matlabpool than you otherwise would. so 11 in your case. if you batch first and then matlabpool, it will automatically do this, but not vice versa.
to see the queue:
c=parcluster
c.Jobs
interestingly, if you open up a second matlab instance, you can get another 12 workers. but strangely not with a third. makes sense though i guess as if you actually use them all it will thrash.