Pytorch gradient calculation with two different models - neural-network

I am new to pytorch and this is what I want to do:
I have two models that are related to one another: modelA and modelB. I want to get three separate gradients. I am able to get the first two without any issues, but I am not sure how I can get the third one.
differentiating the loss of modelA wrt modelA.parameters
[v.grad.data for v in modelA.parameters()]
differentiating the loss of modelB wrt modelB.parameters
[v.grad.data for v in modelB.parameters()]
differentiating the loss of modelB wrt modelA.parameters
This is what I tried:
torch.autograd.grad(lossB, modelA.parameters())
However, I get the following error:
grad can be implicitly created only for scalar outputs
Any help would be great!

Related

Matlab's fitcsvm classfication model "Beta weights" empty

Hello problem solvers,
I am currently analyzing some fMRI data of a visual task and stuck with a problem. The idea was to train an SVM classifier on some data and then use the weight vector to project novel data onto it. However, the model struct created by fitcsvm (Matlab 2018a) only sometimes actually contains a weight vector. In many cases mdl.Betas is empty.
mdl = fitcsvm(training, labels)
weights = mdl.Beta
I looked into why that could be and made sure the input is double, is not just zeros and does not contain NaNs. So far, I have not been able to identify a rule as to why it sometimes returns empty and sometimes not. If anything it seems that as I increase the amount of input data, mdl.Betas is less frequently empty. But I can only change so much about that. :(
I am happy about any help!
Thanks!
Edit: Unfortunately, like so often with fMRI, training data is quite limited. The training data consists of 50 to 300 features (depending on brain area) and one example for class A and five examples for class B.

Gradient with respect to the parameters of a specific layer in Pytorch

I am building a model in pytorch with multiple networks. For example let's consider netA and netB. In the loss function I need to work with the composition netA(netB). In different parts of the optimization I need to calculate the gradient of loss_func(netA(netB)) with respect to only the parameters of netA and in another situation I need to calculate the gradients wrt the parameters of netB. How one should approach the problem?
My approach: In the case of calculating the gradient wrt the parameters of netA I use loss_func(netA(netB.detach())).
If I write loss_func(netA(netB).detach()) it seems that the both parameters of netA and netB are detached.
I tried to use loss_func(netA.detach(netB)) in order to only detach the parameters of netA but it doesn't work. (I get the error that netA doesn't have attribute detach.)
The gradients are properties of tensors not networks.
Therefore, you can only .detach a tensor.
You can have different optimizers for each network. This way you can compute gradients for all networks all the time, but only update weights (calling step of the relevant optimizer) for the relevant network.

Fitting multivariate mixed models with continuous and ordinal dependent variables

I would like to run a multivariate mixed regression MCMC model with two response (independent) variables, namely Boldness scores (continuous variable) and Aggression ranks (ordinal ranks). Trial numbers (integers) are the fixed effect while individual ID is the random effect. I'm using a mixed model approach to partition between-individual co-variance from within-individual co-variance. I would much appreciate if someone lets me know how to do this, and which package to use, preferably in R and what priors to specify. Thank you very much in advance!

One class learning to make predictions using MATLAB

I am using MATLAB to build a prediction model which the target is binary.
The problem is that those negative observations in my training data may indeed are positives but are just not detected.
I started with a logistic regression model assuming the data is accurate and the results are less than satisfactory. After some research, I moved to one class learning hoping that I can focus on the only the part of data (the positives) that I am certain with.
I looked up the related materials from MATLAB documentation and found that I can use fitcsvm to proceed.
My current problem is:
Am I on the right path? Can one class learning solve my problem?
I tried to use fitcsvm to create a ClassificationSVM using all the positive observations that I have.
model = fitcsvm(Instance,Label,'KernelScale','auto','Standardize',true)
However, when I try to use the model to predict
[label,score] = predict(model,Test)
All the labels predicted for my Test cases are 1. I think I did something wrong. So should I feed the svm only the positive observations that I have?
If not what should I do?

Error function and ReLu in a CNN

I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).