Training a feed forward network on GPU using Chainer. After training some batches, getting aerror "CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered".
only 1.5GB out of 11GB gpu memory is in use.
Once the error is encountered, attempt to create a small cupy array too fails with same error.
Chainer : v6.5.0
Cupy : v6.5.0
Related
I had done a model training on Densenet161 and I saved my model
torch.save(model_ft.state_dict(),'/content/drive/My Drive/Stanford40/densenet161.pth')
and follow by this
model = models.densenet161(pretrained=False,num_classes=11)
model_ft.classifier=nn.Linear(2208,11)
model.load_state_dict(torch.load('/content/drive/My Drive/Stanford40/densenet161.pth'))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model=model.to(device)
Then, when I want to proceed to the model evaluation
test_tuple=datasets.ImageFolder('/content/drive/My Drive/Stanford40/body/valid',transform=data_transforms['valid'])
test_dataloader=torch.utils.data.DataLoader(test_tuple,batch_size=1,shuffle=True)
class_names=test_tuple.classes
i=0
length=dataset_sizes['valid']
y_true=torch.zeros([length,1])
y_pred=torch.zeros([length,1])
for inputs ,labels in test_dataloader:
model_ft.eval()
inputs=inputs.to(device)
outputs=model_ft(inputs)
y_true[i][0]=labels
maxvlaues,indices = torch.max(outputs, 1)
y_pred[i][0]=indices
i=i+1
and I face the error as in the picture:
when I check whether my model was moved to the device with this code
next(model.parameters()).is_cuda
The result is True.
How can I modify the code to get away from this error?
My model training part can be found at How to solve TypeError: can’t convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first
You have moved model to the GPU, but not model_ft. The runtime error is at outputs = model_ft(inputs). Could it be a case of mixed-up variable names?
I am trying to verify my "port" of google's DeepLabV3 to coreML. I converted the model (excluding the final bilinear resize) using tfcoreml. My test app crashes on device at:
let options = MLPredictionOptions()
options.usesCPUOnly = false
let outFeatures = try! self.segmentModel.prediction(from: input,options: options)
with the following error:
2018-07-10 19:25:02.017634-0500 VisionDetection[5608:858692] Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code -536870211)
2018-07-10 19:25:02.351176-0500 VisionDetection[5608:858644] [coreml] Error computing NN outputs -1
I've been unable to find any information on IOAF code -536870211.
It does run (albeit very slowly) if I set usesCPUOnly = true, but I am going to need to use the gpu to get the speed up to a usable level.
Any information would be appreciated. I have an iPhone 6 running 11.3, and Xcode 9.4
Update:
I've been able to find information on IOAF code -536870211, it is a memory allocation problem. I guess this one might just not fit on the GPU?
I get lately a lot the HAL_UART_ERROR_FE (Frame Error). I have no where found what causes this Error in the first place. Can someone explain to me what was going wrong that i get this error?
A framing error can be caused by
Mismatched bitrate
Noise on the line
Starting the receiver while the other endpoint is already transmitting
Using MetalKit for iOS 10, when we try to perform MPSCNNConvolution, with inputs as following :
Kernel Size : 16x16
Input channels : 300
Output channels : 250
Dimensions of input image : 250x250x300
Execution of Command Buffer takes over 10 seconds and after that it exits saying "Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF Code 2)". How to fix this?
Is there a way to fasten the process? (as 10 seconds is too much for executing these high-dimensional convolutions)
With the help of these convolutions, our aim is to execute deconvolution and as there is no API on it yet, we try to do it on our own. Is there any API methods to perform these deconvolution operations?
It sounds like there was an error that led to a timeout. I don't think the execution time of your program is the actual cause of the timeout.
I would try the following: go to Product -> Scheme -> Edit Scheme -> with "Run" selected hit the Options tab -> Set Metal API Validation to Enabled.
That will allow Metal to throw an exception the moment you pass it invalid parameters, rather than spitting out mysterious errors later on.
I am running the following MATLAB code on a system with one GTX 1080 and a K80 (with 2 GPUs)
delete(gcp('nocreate'));
parpool('local',2);
spmd
gpuDevice(labindex+1)
end
reset(gpuDevice(2))
reset(gpuDevice(3))
parfor i=1:100
SingleGPUMatlabCode(i);
end
The code runs for around a second. When I rerun the code after few seconds. I get the message:
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The
CUDA error was:
unknown error
Error in CreateDictionary
reset(gpuDevice(2))
I tried increasing TdrDelay, but it did not help.
Something in your GPU code is causing an error on the device. Because the code is running asynchronously, this error is not picked up until the next synchronisation point, which is when you run the code again. I would need to see the contents of SingleGPUMatlabCode to know what that error might be. Perhaps there's an allocation failure or an out of bounds access. Errors that aren't correctly handled will get converted to 'unknown error' at the next CUDA operation.
Try adding wait(gpuDevice) inside the loop to identify when the error is occurring.
If either device 2 or 3 are the GTX1080, you may have discovered an issue with MATLAB's restricted support for the Pascal architecture. See https://www.mathworks.com/matlabcentral/answers/309235-can-i-use-my-nvidia-pascal-architecture-gpu-with-matlab-for-gpu-computing
If this is caused by the Windows timeout, you would see a several second screen blackout.