Multiple GPU code on Matlab runs for few seconds only - matlab

I am running the following MATLAB code on a system with one GTX 1080 and a K80 (with 2 GPUs)
delete(gcp('nocreate'));
parpool('local',2);
spmd
gpuDevice(labindex+1)
end
reset(gpuDevice(2))
reset(gpuDevice(3))
parfor i=1:100
SingleGPUMatlabCode(i);
end
The code runs for around a second. When I rerun the code after few seconds. I get the message:
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The
CUDA error was:
unknown error
Error in CreateDictionary
reset(gpuDevice(2))
I tried increasing TdrDelay, but it did not help.

Something in your GPU code is causing an error on the device. Because the code is running asynchronously, this error is not picked up until the next synchronisation point, which is when you run the code again. I would need to see the contents of SingleGPUMatlabCode to know what that error might be. Perhaps there's an allocation failure or an out of bounds access. Errors that aren't correctly handled will get converted to 'unknown error' at the next CUDA operation.
Try adding wait(gpuDevice) inside the loop to identify when the error is occurring.
If either device 2 or 3 are the GTX1080, you may have discovered an issue with MATLAB's restricted support for the Pascal architecture. See https://www.mathworks.com/matlabcentral/answers/309235-can-i-use-my-nvidia-pascal-architecture-gpu-with-matlab-for-gpu-computing
If this is caused by the Windows timeout, you would see a several second screen blackout.

Related

MetalKit for iOS 10 : Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF Code 2)

Using MetalKit for iOS 10, when we try to perform MPSCNNConvolution, with inputs as following :
Kernel Size : 16x16
Input channels : 300
Output channels : 250
Dimensions of input image : 250x250x300
Execution of Command Buffer takes over 10 seconds and after that it exits saying "Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF Code 2)". How to fix this?
Is there a way to fasten the process? (as 10 seconds is too much for executing these high-dimensional convolutions)
With the help of these convolutions, our aim is to execute deconvolution and as there is no API on it yet, we try to do it on our own. Is there any API methods to perform these deconvolution operations?
It sounds like there was an error that led to a timeout. I don't think the execution time of your program is the actual cause of the timeout.
I would try the following: go to Product -> Scheme -> Edit Scheme -> with "Run" selected hit the Options tab -> Set Metal API Validation to Enabled.
That will allow Metal to throw an exception the moment you pass it invalid parameters, rather than spitting out mysterious errors later on.

mxDestroyArray double free or corruption

I am running Matlab from a Fortran function and am having a persistent problem that I am getting the error
*** glibc detected *** /matlab/8.5/bin/glnxa64/MATLAB: double free or corruption (out): 0x00002b11a9a86f20 ***
I am not sure which line the error is occuring on but I have quite a few that follow this pattern
MLVar = engGetVariable(ep, 'un')
call mxCopyPtrToReal8(mxGetPr(MLVar), SurfaceField, BoundaryCells)
call mxDestroyArray(MLVar)
and I go through this function between 1 and 100s of times before this error occurs.
It looks like here they said to use mxDestroyArray which I'm already using.
Any advise?
The problem ended up being completely unrelated. I am submitting this to a remote cluster using a submission script, and I used a "V" option and then when I closed my terminal connection with Matlab forced close.

c++ amp matrixmultiplication accelerator_view_removed at memory location

I am playing with the matrixmultiplication project downloadable from the bottom of the site:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/02/matrix-multiplication-sample.aspx
When I change the values of M, N, W from 256 to 4096, an unhandled exception is thrown:
Unhandled exception at 0x7630C42D in MatrixMultiplication.exe: Microsoft C++ exception: Concurrency::accelerator_view_removed at memory location 0x001CE2F0.
The console output is:
Using device: NVIDIA GeForce GT 640M
MatrixDiemnsion C(4096x4096) = A(4096x4096) * B(4096x4096)
CPU(single core) exec completed.
AMP Simple
The next statement to be executed is leaving the function mxm_amp_simple.
I am using VS2013 Ultimate on Windows 7 Professional N.
Why does this occur and how to prevent this from happening?
EDIT: I have found that the greatest value for M,N,W with which AMP Simple does not lead to a breakpoint being hit is 2800 (M=2800, N=2800, W=2800).
AMP Tiled on the other hand sometimes leads to a breakpoint, and in other cases executes correctly for M,N,W equal to 4096.
The exception is accompanied by a system error message:
"Display driver stopped responding and has recovered. Display driver NVIDIA Windows Kernel Mode Driver, Version 331.65 stopped responding and has successfully recovered."
In case someone else needs this.
This issue is most likely caused by Timeout Detection and Recovery (TDR). If kernel runs for more then 2 seconds windows will kill it and throw Concurrency::accelerator_view_removed exception. The easiest way to check this is to wrap code in try / catch bock. E.g.
try {
av_c.synchronize();
} catch (const Concurrency::accelerator_view_removed& e) {
printf("%s\n", e.what());
}
Microsoft has a blog post with more information, including pointers to instructions how to disable it.

ram keeps on filling when program is abruptly closed in matlab

MAT-lab keeps on acquiring images form video object when program is closed abruptly,is there any way to know whether my program has been stopped abruptly?
it only stops when i type stop(vid), condition: the vid object must be there in work space
if you have cleared vid object by clear all , then MAT-LAB keeps on acquiring image from the camera
I think what you mean is this: You have some matlab code that acquires data from a camera. If your code exits before getting to the point at which you order the camera to stop acquiring data, then the camera keeps going until your RAM is full Please correct me if I'm wrong. If I'm right, I'd advise you use try/catch statements as follows:
start(vid);
try
%some code to use the video data
stop(vid);
catch
stop(vid);
end
Everything inside 'try' will run, and when it has finished, the video capture will be terminated. If something goes wrong and throws an error before it has got as far as the stop(vid); command, then rather than stalling the program it will display the error in the matlab prompt, and then jump to execute any code inside 'catch'. This means that if your code ends suddenly, the stop(vid); command is still run, and you don't run out of RAM.

Perl tk main window error

I have a Perl Tk application.
If I move the main window so that it's not right up to the uppermost part of the screen, then the next time the following code is executed, the script fails:
$canvas_fimage_real=$canvas_fimage->Subwidget('canvas');
$canvas_fimage_real=$canvas_fimage unless $canvas_fimage_real;
my $canvas_id=$canvas_fimage_real->id;
my $canvas_fimage_photo=$main_window::main_window->Photo(-format=>'Window', -data=>oct $canvas_id );
And it fails with the following error message:
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 73 (X_GetImage)
Serial number of failed request: 2796
Current serial number in output stream: 2796
The script crashes at the Photo command.
How can I fix this?
Is this a window that is wholly on the screen? The snapshotting facility only works with what is visible on-screen (a low-level X11 condition; not negotiable). As such, you should file a bug report as the snapshot code shouldn't ask for things that it can't get.
Of course, if the window is fully on screen and you're getting that error message anyway, that's a serious problem. File a bug report in that case too!