crash on the GPU with {inc,set}_subtensor and broadcasting the value - neural-network

I am fine-tuning vgg16 network with keras 2.0.2 and theano 0.9.0 as backend on Windows10 64bit Anaconda 2 as this blog:https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
I find someone else had the same issue in the pull requests and it was fixed by changing a few lines of code (link: https://github.com/Theano/Theano/pull/2075). However , that's an old version of theano.(the pr was in 2014) . Theano 0.9.0 have already change the code and I still have this problem
every time I run the last line(i.e. model .fit_generator) , it shows that everything works fine until the last of first epoch. That's when exactly GPU always crash
model.fit_generator(
train_generator,
samples_per_epoch=2000,
nb_epoch=50,
validation_data=validation_generator,
nb_val_samples=400)
And here is the error message:
CudaNdarray_CopyFromCudaNdarray: need same dimensions for dim 0,
destination=32, source=16Apply node that caused the error:
GpuIncSubtensor{Set;::, ::, int64:int64:,
int64:int64:}(GpuAlloc{memset_0=True}.0,
GpuElemwise{mul,no_inplace}.0, Constant{1}, Constant{225},
Constant{1}, Constant{225})Toposort index: 143

Related

VS code showing Error: Session cannot generate requests after every use of catboost with gpu

I have been trying to use my Nvidia Geforce GTX 1650 GPU for training catboost regressor.
It worked well but after finish training, it kills the kernel and needs to restart the vs code
Here is the code:-
import pandas as pd
import numpy as np
df = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
from catboost import CatBoostRegressor
cat = CatBoostRegressor(iterations=2000,learning_rate=0.061582,task_type='GPU')
cat.fit(df.drop('loss',axis = 1),df.loss)
This run fine but every time I try to run the next cell it shows this error:
Error: Session cannot generate requests
Error: Session cannot generate requests
at w.executeCodeCell (c:\Users\singh\.vscode\extensions\ms-toolsai.jupyter-2021.8.1236758218\out\client\extension.js:90:327199)
at w.execute (c:\Users\singh\.vscode\extensions\ms-toolsai.jupyter-2021.8.1236758218\out\client\extension.js:90:326520)
at w.start (c:\Users\singh\.vscode\extensions\ms-toolsai.jupyter-2021.8.1236758218\out\client\extension.js:90:322336)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async t.CellExecutionQueue.executeQueuedCells (c:\Users\singh\.vscode\extensions\ms-toolsai.jupyter-2021.8.1236758218\out\client\extension.js:90:336863)
at async t.CellExecutionQueue.start (c:\Users\singh\.vscode\extensions\ms-toolsai.jupyter-2021.8.1236758218\out\client\extension.js:90:336403)
I have updated all my packages using pip-review, updated jupyter extension, and xgboost with tree_method = 'gpu_hist' is working fine.
Operating System - Windows
Cuda version - 11.2
Nvidia Driver - 462
I had the same issue, I restarted the kernel and VS code and it seems to have fixed the issue.
In my experience, in only means that somewhere in my code there is an 'infinite loop'. The way I solved this was to restart VS Code and checked my code for said "infinite loop" before I rerun it again. I hope this helped...

Nvidia digits on TX2 Error code 1

I am new to Digits and TX2. I am trying to create object detection model using the tutorial from: https://github.com/dusty-nv/jetson-inference
I created dataset sucessfully. The issue is with the model
While creating a model, I am getting the following error.
Memory required for data: 3268934784
creating layer bbox_loss
Creating Layer bbox_loss
bbox_loss <- bboxes-obj-masked-norm
bbox_loss <- bbox-obj-label-norm
bbox_loss -> loss_bbox
Setting up bbox_loss
Top shape: (1)
with loss weight 2
Memory required for data: 3268934788
Creating layer coverage_loss
Creating Layer coverage_loss
coverage_loss <- coverage_coverage/sig_0_split_0
coverage_loss <- coverage-label_slice-label_4_split_0
coverage_loss -> loss_coverage
Setting up coverage_loss
Top shape: (1)
with loss weight 1
Memory required for data: 3268934792
Creating layer cluster
The job directory information on the left is:
Job Directory
/home/nvidia/DIGITS/digits/jobs/20180816-161051-e67a
Disk Size
0 B
Network (train/val)
train_val.prototxt
Network (deploy)
deploy.prototxt
Network (original)
original.prototxt
Solver
solver.prototxt
Raw caffe output
caffe_output.log
Pretrained Model
/home/nvidia/bvlc_googlenet.caffemodel.4
Visualizations
Tensorboard
The error on the server is
2018-08-16 16:10:53 [20180816-161051-e67a] [INFO ] Task subprocess args: "/home/nvidia/Caffe/caffe/build/tools/caffe train --solver=/home/nvidia/DIGITS/digits/jobs/20180816-161051-e67a/solver.prototxt --gpu=0 --weights=/home/nvidia/bvlc_googlenet.caffemodel.4"
2018-08-16 16:11:00 [20180816-161051-e67a] [ERROR] Train Caffe Model task failed with error code 1
I have no idea on how to free up memory as I have more than 2 gb available in the job directory.
Please help me. Thanks in advance.
Had the same issue for the last few days, maybe it will help someone in the future. Firstly, make sure that you have the right version of protobuf. You can check it with:
protoc --version
If it's 2.* you have to update to 3.*, for example to build it as listed here https://github.com/NVIDIA/DIGITS/blob/digits-6.0/docs/BuildProtobuf.md, and then rebuild the Caffe. Also, make sure that you have the compatible version of pip package of protobuf. For me the following version is working well right now for Digits and Caffe from the tutorial https://github.com/dusty-nv/jetson-inference :
pip install --user --upgrade protobuf==3.1.0.post1

Can't Get The matlabpool Started

I'm running MATLAB on since about one and a half year now. And I've been trying to get matlabpool ready to go once in nearly every three months. Before I give up on it completely, I've decided to ask for help. :)
My problem starts with matlabpool command. Whenever I type in the matlabpool command, I get this :
One or more output arguments not assigned during call to "system_dependent".
Error in matlabpool>iIsOnClient (line 73)
onclient = ~system_dependent('isdmlworker');
Error in matlabpool>iVerifyJava (line 64)
if iIsOnClient()
Error in matlabpool (line 10)
iVerifyJava();
After some research and sleepless nights, I've found out that one has to settle the things down with the "Cluster Profile Manager". But I never did have the opportunity to see it working either. Here is what I get after clicking Cluster Profile Manager from Parallel panel :
com.mathworks.jmi.MatlabException: Feature isdmlworker not found
at com.mathworks.jmi.NativeMatlab.SendMatlabMessage(Native Method)
at com.mathworks.jmi.NativeMatlab.sendMatlabMessage(NativeMatlab.java:266)
at com.mathworks.jmi.MatlabLooper.sendMatlabMessage(MatlabLooper.java:120)
at com.mathworks.jmi.Matlab.mtFeval(Matlab.java:1710)
at com.mathworks.jmi.MatlabWorker.feval(MatlabWorker.java:197)
at com.mathworks.toolbox.distcomp.ui.profile.model.MatlabProfileManager$1.runOnMatlabThread(MatlabProfileManager.java:80)
at com.mathworks.jmi.MatlabWorker$2.run(MatlabWorker.java:79)
at com.mathworks.jmi.NativeMatlab.dispatchMTRequests(NativeMatlab.java:475)
Attempt to reference field of non-structure array.
Error in parallel.internal.ui.AbstractValidationManager (line 20)
obj.Validator.addlistener('ValidationStarted', ...
Error in parallel.internal.ui.ValidationManager (line 21)
obj#parallel.internal.ui.AbstractValidationManager();
com.mathworks.jmi.MatlabException: Attempt to reference field of non-structure array.
at com.mathworks.jmi.NativeMatlab.SendMatlabMessage(Native Method)
at com.mathworks.jmi.NativeMatlab.sendMatlabMessage(NativeMatlab.java:266)
at com.mathworks.jmi.MatlabLooper.sendMatlabMessage(MatlabLooper.java:120)
at com.mathworks.jmi.Matlab.mtFevalConsoleOutput(Matlab.java:1778)
at com.mathworks.jmi.MatlabWorker.feval(MatlabWorker.java:195)
at com.mathworks.jmi.MatlabWorker.feval(MatlabWorker.java:172)
at com.mathworks.toolbox.distcomp.ui.profile.model.ValidationManager$1.runOnMatlabThread(ValidationManager.java:45)
at com.mathworks.jmi.MatlabWorker$2.run(MatlabWorker.java:79)
at com.mathworks.jmi.NativeMatlab.dispatchMTRequests(NativeMatlab.java:475)
After getting this message, the Cluster Profile Manager pops up, but doesn't really shows anything besides "wait" sign. I've checked my Distributed Computing Licence and that too is looking fine.
license checkout Distrib_Computing_Toolbox
command returns 1.
By the way, there is another error message which I suspected to have a connection with my problem of some kind. I get this in every MATLAB start:
Error using feature
Feature isdmlworker not found
Error in matlabrc (line 187)
if ~(ismcc || isdeployed || feature('isdmlworker')) && usejava('jvm')
In addition to all of those; I get this message whenever I try to open "Parallel Preferences" from Environment tab :
com.mathworks.jmi.MatlabException: Feature isdmlworker not found
at com.mathworks.jmi.NativeMatlab.SendMatlabMessage(Native Method)
at com.mathworks.jmi.NativeMatlab.sendMatlabMessage(NativeMatlab.java:265)
at com.mathworks.jmi.MatlabLooper.sendMatlabMessage(MatlabLooper.java:120)
at com.mathworks.jmi.Matlab.mtFeval(Matlab.java:1619)
at com.mathworks.jmi.MatlabWorker.feval(MatlabWorker.java:197)
at com.mathworks.toolbox.distcomp.ui.profile.model.MatlabProfileManager$1.runOnMatlabThread(MatlabProfileManager.java:72)
at com.mathworks.jmi.MatlabWorker$2.run(MatlabWorker.java:79)
at com.mathworks.jmi.NativeMatlab.dispatchMTRequests(NativeMatlab.java:440)
I've tried to find the function system_dependent.m but it doesn't seem to exist. Other common spots of the errors I get, the function "feature.m", the option "isdmlworker" are other mysteries that I couldn't find any kind of information about.
I really appreciate if anyone can help me with the problem I've encountered starting MATLAB's distributed computing system.
Edit: I'm working on an Ubuntu 14.04 and my MATLAB version is R2014a.
This appears to be an issue with your specific installation of Ubuntu 14.04. It is possible though that it relates to how matlabpool spawns worker threads in R2014a given that the error occurs in com.mathworks.jmi.NativeMatlab.dispatchMTRequests().
matlabpool has been tested to work without issue on Ubuntu 15.04 and 15.10. It may not be an ideal solution but upgrading Ubuntu to 15.04 or 15.10 and reinstalling MATLAB R2014a should resolve the issue.

Psychophysics Toolbox Matlab on Ubuntu Installation

I am trying to run code in Matlab that uses the Psychtoolbox and OpenGL. The commands that throw the error described below are:
PsychJavaTrouble
AssertOpenGL
Here are my specs:
OS: Ubuntu 14.04 LTS, 64bit
Processor: Intel Core i5-2450M CPT # 2.50GHz x 4
Graphics: Intel Sandybridge Mobile
Matlab Version: Matlab 64-Bit (Version 3.0.11 - Build date: Apr 6 2014)
Psychophysics version installed: 3
Installation methodology:
1. sudo apt-get install psychtoolbox in Terminal
2. updated it via UpdatePsychToolbox command in Matlab console
Here is the error message:
PsychJavaTrouble: Will now try to add the PsychJava folder to Matlabs dynamic
classpath...
Warning: "/home/lillian/Desktop/Matlab/Mona_Lisa/Psychtoolbox/PsychJava" is already
specified on static java path.
> In javaclasspath>local_validate_dynamic_path at 285
In javaclasspath>local_javapath at 182
In javaclasspath at 119
In javaaddpath at 71
In PsychJavaTrouble at 86
In ReverseCorrelationFaces at 2
PsychJavaTrouble: Added PsychJava folder to dynamic class path. Psychtoolbox Java
commands should work now!
PTB-INFO: Display ':0' : X-Screen 0 : Assigning primary output as 0 with RandR-CRTC
0 and GPU-CRTC 0.
PTB-INFO: This is Psychtoolbox-3 for GNU/Linux X11, under Matlab 64-Bit (Version
3.0.11 - Build date: Apr 6 2014).
PTB-INFO: No low-level controllable GPU on screenId 0. Beamposition timestamping and
other special functions disabled.
PTB-INFO: Failed to enable realtime-scheduling [Operation not permitted]!
PTB-DEBUG:PsychOSGetSwapCompletionTimestamp: Invalid return values ust = 0, msc = 0
from call with success return code (sbc = 304)! Failing with rc = -2.
PTB-DEBUG:PsychOSGetSwapCompletionTimestamp: This likely means a driver bug or
malfunction, or that timestamping support has been disabled by the user in the
driver!
PTB-INFO: OpenGL-Renderer is Intel Open Source Technology Center :: Mesa DRI
Intel(R) Sandybridge Mobile :: 3.0 Mesa 10.1.3
PTB-INFO: VBL startline = 768 , VBL Endline = -1
PTB-INFO: Will try to use OS-Builtin OpenML sync control support for accurate Flip
timestamping.
PTB-INFO: Measured monitor refresh interval from VBLsync = 16.685075 ms [59.933804
Hz]. (297 valid samples taken, stddev=0.310528 ms.)
PTB-INFO: Reported monitor refresh interval from operating system = 16.646968 ms
[60.070999 Hz].
PTB-INFO: Small deviations between reported values are normal and no reason to
worry.
WARNING: Couldn't compute a reliable estimate of monitor refresh interval! Trouble
with VBL syncing?!?
----- ! PTB - ERROR: SYNCHRONIZATION FAILURE ! ----
One or more internal checks (see Warnings above) indicate that synchronization
of Psychtoolbox to the vertical retrace (VBL) is not working on your setup.
This will seriously impair proper stimulus presentation and stimulus presentation
timing!
Please read 'help SyncTrouble' for information about how to solve or work-around the
problem.
You can force Psychtoolbox to continue, despite the severe problems, by adding the
command
Screen('Preference', 'SkipSyncTests', 1); at the top of your script, if you really
know what you are doing.
Error using Screen
See error message printed above.
Error in ReverseCorrelationFaces (line 81)
window=Screen('OpenWindow', windowNum);
What am I missing? A package? Is my hardware not okay? I can't figure this error out.
So.. buried deep inside the DownloadPsychtoolbox.m file found here (see installation instructions here), is the instruction that apparently Psychtoolbox requires a special SDK. Super annoying. I will never use this toolbox again because it's so much drama to use. But this is what was missing that was causing the Screen call to fail
Missing SDK download link:
http://docs.gstreamer.com/display/GstSDK/Installing+on+Windows

GTK applications fail to start - xfs restart needed Options

Sorry, not really programming question, but I am not sure where else I could find some help.
After a recent update (Xorg was updated among other things), GTK apps stopped running in my kde4. I have a Debian unstable, updated around 22 April. When I try to run them I get the following error:
ga#grzes:~$ iceweasel
The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadName (named color or font does not exist)'.
(Details: serial 888 error_code 15 request_code 45 minor_code 0)
(Note to programmers: normally, X errors are reported asynchronously;
that is, you will receive the error a while after causing it.
To debug your program, run it with the --sync command line
option to change this behavior. You can then get a meaningful
backtrace from your debugger if you break on the gdk_x_error() function.)
ga#grzes:~$ gimp The program 'gimp' received an X Window
System error.
This probably reflects a bug in the program.
The error was 'BadName (named color or font does not exist)'.
(Details: serial 6955 error_code 15 request_code 45 minor_code 0)
(Note to programmers: normally, X errors are reported asynchronously;
that is, you will receive the error a while after causing it.
To debug your program, run it with the --sync command line
option to change this behavior. You can then get a meaningful
backtrace from your debugger if you break on the gdk_x_error() function.)
(script-fu:4643): LibGimpBase-WARNING **: script-fu: gimp_wire_read():
error
I have to restart the font server manually to have it fixed:
ga#grzes:~$ su
Password:
grzes:/home/ga# /etc/init.d/xfs restart
Stopping X font server: xfs.
Setting up X font server socket directory /tmp/.font-unix...done.
Starting X font server: xfs.
Any ideas what could be wrong? Is it a configuration issue? My system has been updated for the last 7 years, so I can have some old settings.
Debian unstable is very... unstable now, since a release was made a short time ago. Major changes and packages migrations are happening. Xorg (and all X related stuff) being one of the critical packages in that process. My advice is to perform a new update/upgrade in order to catch a new version that may resolve this problem.
It's very frequent that after an update some thing will get broken in inexplicable ways, simply because the developers are uploading new, and not much tested, version of the applications
I finally figured this out: seems like xfs is not compatible with the other components currently and luckily removing it form the system completely solves the problem.