MetalKit for iOS 10 : Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF Code 2) - iphone

Using MetalKit for iOS 10, when we try to perform MPSCNNConvolution, with inputs as following :
Kernel Size : 16x16
Input channels : 300
Output channels : 250
Dimensions of input image : 250x250x300
Execution of Command Buffer takes over 10 seconds and after that it exits saying "Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF Code 2)". How to fix this?
Is there a way to fasten the process? (as 10 seconds is too much for executing these high-dimensional convolutions)
With the help of these convolutions, our aim is to execute deconvolution and as there is no API on it yet, we try to do it on our own. Is there any API methods to perform these deconvolution operations?

It sounds like there was an error that led to a timeout. I don't think the execution time of your program is the actual cause of the timeout.
I would try the following: go to Product -> Scheme -> Edit Scheme -> with "Run" selected hit the Options tab -> Set Metal API Validation to Enabled.
That will allow Metal to throw an exception the moment you pass it invalid parameters, rather than spitting out mysterious errors later on.

Related

is there a way to disable a mega-detailed error message in Rundeck's failed executions?

The message that is posted every time an execution fails is too verbose and creates too much noise. Is there a way to disable or hide it somehow? We have our own error messages within the scripts and don't need a red chunk of extra text displayed in the logs. Adding an error handler that will exit with code 0 is not an option because we still need the job to fail if a step fails.
That message is Rundeck standard output in case of failure (you can manipulate the loglevel but that doesn't affect the NonZeroResultCode last line, just the text from your commands/scripts), you can suggest that here. The rest are just workarounds as you mentioned.
This occurs even using the Quiet output.

coreML Error computing NN outputs -1, IOAF code -536870211

I am trying to verify my "port" of google's DeepLabV3 to coreML. I converted the model (excluding the final bilinear resize) using tfcoreml. My test app crashes on device at:
let options = MLPredictionOptions()
options.usesCPUOnly = false
let outFeatures = try! self.segmentModel.prediction(from: input,options: options)
with the following error:
2018-07-10 19:25:02.017634-0500 VisionDetection[5608:858692] Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code -536870211)
2018-07-10 19:25:02.351176-0500 VisionDetection[5608:858644] [coreml] Error computing NN outputs -1
I've been unable to find any information on IOAF code -536870211.
It does run (albeit very slowly) if I set usesCPUOnly = true, but I am going to need to use the gpu to get the speed up to a usable level.
Any information would be appreciated. I have an iPhone 6 running 11.3, and Xcode 9.4
Update:
I've been able to find information on IOAF code -536870211, it is a memory allocation problem. I guess this one might just not fit on the GPU?

Multiple GPU code on Matlab runs for few seconds only

I am running the following MATLAB code on a system with one GTX 1080 and a K80 (with 2 GPUs)
delete(gcp('nocreate'));
parpool('local',2);
spmd
gpuDevice(labindex+1)
end
reset(gpuDevice(2))
reset(gpuDevice(3))
parfor i=1:100
SingleGPUMatlabCode(i);
end
The code runs for around a second. When I rerun the code after few seconds. I get the message:
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The
CUDA error was:
unknown error
Error in CreateDictionary
reset(gpuDevice(2))
I tried increasing TdrDelay, but it did not help.
Something in your GPU code is causing an error on the device. Because the code is running asynchronously, this error is not picked up until the next synchronisation point, which is when you run the code again. I would need to see the contents of SingleGPUMatlabCode to know what that error might be. Perhaps there's an allocation failure or an out of bounds access. Errors that aren't correctly handled will get converted to 'unknown error' at the next CUDA operation.
Try adding wait(gpuDevice) inside the loop to identify when the error is occurring.
If either device 2 or 3 are the GTX1080, you may have discovered an issue with MATLAB's restricted support for the Pascal architecture. See https://www.mathworks.com/matlabcentral/answers/309235-can-i-use-my-nvidia-pascal-architecture-gpu-with-matlab-for-gpu-computing
If this is caused by the Windows timeout, you would see a several second screen blackout.

c++ amp matrixmultiplication accelerator_view_removed at memory location

I am playing with the matrixmultiplication project downloadable from the bottom of the site:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/02/matrix-multiplication-sample.aspx
When I change the values of M, N, W from 256 to 4096, an unhandled exception is thrown:
Unhandled exception at 0x7630C42D in MatrixMultiplication.exe: Microsoft C++ exception: Concurrency::accelerator_view_removed at memory location 0x001CE2F0.
The console output is:
Using device: NVIDIA GeForce GT 640M
MatrixDiemnsion C(4096x4096) = A(4096x4096) * B(4096x4096)
CPU(single core) exec completed.
AMP Simple
The next statement to be executed is leaving the function mxm_amp_simple.
I am using VS2013 Ultimate on Windows 7 Professional N.
Why does this occur and how to prevent this from happening?
EDIT: I have found that the greatest value for M,N,W with which AMP Simple does not lead to a breakpoint being hit is 2800 (M=2800, N=2800, W=2800).
AMP Tiled on the other hand sometimes leads to a breakpoint, and in other cases executes correctly for M,N,W equal to 4096.
The exception is accompanied by a system error message:
"Display driver stopped responding and has recovered. Display driver NVIDIA Windows Kernel Mode Driver, Version 331.65 stopped responding and has successfully recovered."
In case someone else needs this.
This issue is most likely caused by Timeout Detection and Recovery (TDR). If kernel runs for more then 2 seconds windows will kill it and throw Concurrency::accelerator_view_removed exception. The easiest way to check this is to wrap code in try / catch bock. E.g.
try {
av_c.synchronize();
} catch (const Concurrency::accelerator_view_removed& e) {
printf("%s\n", e.what());
}
Microsoft has a blog post with more information, including pointers to instructions how to disable it.

Perl tk main window error

I have a Perl Tk application.
If I move the main window so that it's not right up to the uppermost part of the screen, then the next time the following code is executed, the script fails:
$canvas_fimage_real=$canvas_fimage->Subwidget('canvas');
$canvas_fimage_real=$canvas_fimage unless $canvas_fimage_real;
my $canvas_id=$canvas_fimage_real->id;
my $canvas_fimage_photo=$main_window::main_window->Photo(-format=>'Window', -data=>oct $canvas_id );
And it fails with the following error message:
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 73 (X_GetImage)
Serial number of failed request: 2796
Current serial number in output stream: 2796
The script crashes at the Photo command.
How can I fix this?
Is this a window that is wholly on the screen? The snapshotting facility only works with what is visible on-screen (a low-level X11 condition; not negotiable). As such, you should file a bug report as the snapshot code shouldn't ask for things that it can't get.
Of course, if the window is fully on screen and you're getting that error message anyway, that's a serious problem. File a bug report in that case too!