Custom loss function breaking backpropagation in Pytorch - neural-network

I am using this class
class MonetaryLoss(nn.Module):
def __init__(self,risk,payout):
super(MonetaryLoss,self).__init__()
self.risk = risk
self.payout = payout
def forward(self,pred,y):
equity = 100
pred = t.max(pred,1).indices
for i in range(len(pred)):
if (pred[i] == y[i]):
equity += equity*self.risk*self.payout
else:
equity -= equity*self.risk
print(equity)
return equity
to prepare some examples of custom losses. When I use t.max(pred,1).indices the gradient is lost and this forward() function does not return anything with a gradient. I thought using
loss = Variable(t.Tensor[0], requires_grad = True)
and then updating the loss variable over training iterations would solve the problem. However, it breaks the backpropagation chain. What is the best way to create this loss and leave the backpropagation chain intact?

You're loss function has to be differentiable in the whole domain, that means no sharp turning points. In layman's terms, has to be "smooth".

What you are taking is the argmax of the tensor. Since this operator is not differentiable. In practical terms, this means you can't backpropagate through that operation i.e. call backward on the results of pred.max(1).indices or similarly pred.argmax(1).
Here have a look:
>>> pred = torch.rand(10, 10, requires_grad=True)
>>> values, indices = pred.max(1)
>>> values.grad_fn # can be backpropagated on:
<MaxBackward0 at 0x7febc10d2ed0>
>>> indices.grad_fn # can't be backpropagated on:
None

Related

How to get the gradient of the specific output of the neural network to the network parameters

I am building a Bayesian neural network, and I need to manually calculate the gradient of each neural network output and update the network parameters.
For example, in the following network, how can I get the gradient of neural network output ag and bg to the neural network parameters phi, it's --∂ag/∂phi and ∂bg/∂phi--, and update the parameters respectively.
class encoder(torch.nn.Module):
def __init__(self, _l_dim, _hidden_dim, _fg_dim):
super(encoder, self).__init__()
self.hidden_nn = nn.Linear(_l_dim, _hidden_dim)
self.ag_nn = nn.Linear(_hidden_dim, _fg_dim)
self.bg_nn = nn.Linear(_hidden_dim, _fg_dim)
def forward(self, _lg):
ag = self.ag_nn(self.hidden_nn(_lg))
bg = self.bg_nn(self.hidden_nn(_lg))
return ag, bg
If you want do compute dx/dW, you can use autograd for that. torch.autograd.grad(x, W, grad_outputs=torch.ones_like(x), retain_graph=True). Does that actually accomplish what you're trying to do?
Problem statement
You are looking to compute the gradients of the parameters corresponding to each loss term. Given a model f, parametrized by θ_ag and θ_bg. These two parameter sets might overlap: that's the case here since you have a shared hidden layer. Then f(x; θ_ag, θ_bg) will output a pair of elements ag and bg. Your loss function is defined as L = L_ag + L_bg.
The terms you want to compute are dL_ag/dθ_ag and dL_bg/dθ_bg, which is different from what you would typically get with a single backward call: which gives dL/dθ_ag and dL/dθ_bg.
Implementation
In order to compute those terms, you will require two backward passes, after both of them we will compute the respective terms. Before starting, here are a couple things you need to do:
It will be useful to make θ_ag and θ_bg available to us. You can, for example, add those two functions in your model definition:
def ag_params(self):
return [*self.hidden_nn.parameters(), *self.ag_nn.parameters()]
def bg_params(self):
return [*self.hidden_nn.parameters(), *self.bg_nn.parameters()]
Assuming you have a loss function loss_fn which outputs two scalar values L_ab and L_bg. Here is a mockup for loss_fn:
def loss_fn(ab, bg):
return ab.mean(), bg.mean()
We will need an optimizer to zero the gradient out, here SGD:
optim = torch.optim.SGD(model.parameters(), lr=1e-3)
Then we can start applying the following method:
Do an inference to compute ag, and bg as well as L_ag, and L_bg:
>>> ag, bg = model(x)
>>> L_ag, L_bg = loss_fn(ag, bg)
Backpropagate once on L_ag, while retaining the graph:
>>> L_ag.backward(retain_graph=True)
At this point, we can collect dL_ag/dθ_ag on the parameters contained in θ_ag. For example, you could pick the norm of the different parameter gradients using the ag_params function:
>>> pgrad_ag = torch.stack([p.grad.norm()
for p in m.ag_params() if p.grad is not None])
Next we can proceed with a second backpropagation, this time on L_bg. But before that, we need to clear the gradients so dL_ag/dθ_ag doesn't pollute the next computation:
>>> optim.zero_grad()
Backpropagation on L_bg:
>>> L_bg.backward(retain_graph=True)
Here again, we collect the gradient norms, i.e. the gradient of dL/dθ_bg, this time using the bg_params function:
>>> pgrad_bg = torch.stack([p.grad.norm()
for p in m.bg_params() if p.grad is not None])
Now you have pgrad_ag and pgrad_bg which correspond to the gradient norms of dL/dθ_bg, and dL/dθ_bg respectively.

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

MSELoss when mask is used

I'm trying to calculate MSELoss when mask is used. Suppose that I have tensor with batch_size of 2: [2, 33, 1] as my target, and another input tensor with the same shape. Since sequence length might differ for each instance, I have also a binary mask indicating the existence of each element in the input sequence. So here is what I'm doing:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
mse_loss_val = loss / loss.numel()
# now doing backpropagation
mse_loss_val.backward()
Is loss / loss.numel() a good practice? I'm skeptical, as I have to use reduction='none' and when calculating final loss value, I think I should calculate the loss only considering those loss elements that are nonzero (i.e., unmasked), however, I'm taking the average over all tensor elements with torch.numel(). I'm actually trying to take 1/n factor of MSELoss into account. Any thoughts?
There are some issues in the code. I think correct code should be:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
non_zero_elements = mask.sum()
mse_loss_val = loss / non_zero_elements
# now doing backpropagation
mse_loss_val.backward()
This is only slightly worse than using .mean() if you are worried about numerical errors.

Multiple matrix multiplication loses weight updates

When in forward method I only do one set of torch.add(torch.bmm(x, exp_w), self.b) then my model is back propagating correctly. When I add another layer - torch.add(torch.bmm(out, exp_w2), self.b2) - then the gradients are not updated and the model isn't learning. If I change the activation function from nn.Sigmoid to nn.ReLU then it works with two layers.
Been thinking about this a day now, and not figuring out why it's not working with nn.Sigmoid.
I've tried different learning rates, Loss functions and optimization functions, but no combination seems to work. When I add the weights together before and after training they are the same.
Code:
class MyModel(nn.Module):
def __init__(self, input_dim, output_dim):
torch.manual_seed(1)
super(MyModel, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
hidden_1_dimentsions = 20
self.w = torch.nn.Parameter(torch.empty(input_dim, hidden_1_dimentsions).uniform_(0, 1))
self.b = torch.nn.Parameter(torch.empty(hidden_1_dimentsions).uniform_(0, 1))
self.w2 = torch.nn.Parameter(torch.empty(hidden_1_dimentsions, output_dim).uniform_(0, 1))
self.b2 = torch.nn.Parameter(torch.empty(output_dim).uniform_(0, 1))
def activation(self):
return torch.nn.Sigmoid()
def forward(self, x):
x = x.view((x.shape[0], 1, self.input_dim))
exp_w = self.w.expand(x.shape[0], self.w.size(0), self.w.size(1))
out = torch.add(torch.bmm(x, exp_w), self.b)
exp_w2 = self.w2.expand(out.shape[0], self.w2.size(0), self.w2.size(1))
out = torch.add(torch.bmm(out, exp_w2), self.b2)
out = self.activation()(out)
return out.view(x.shape[0])
Besides loss functions, activation functions and learning rates, your parameter initialisation is also important. I suggest you to take a look at Xavier initialisation: https://pytorch.org/docs/stable/nn.html#torch.nn.init.xavier_uniform_
Furthermore, for a wide range of problems and network architectures Batch Normalization, which ensures that your activations have zero mean and standard deviation, helps: https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d
If you are interested to know more about the reason for this, it's mostly due to the vanishing gradient problem, which means that your gradients get so small that your weights don't get updated. It's so common that it has its own page on Wikipedia: https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Fitting a sine wave with Keras and PYMC3 yields unexpected results

I've been trying to fit a sine curve with a keras (theano backend) model using pymc3. I've been using this [http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/] as a reference point.
A Keras implementation alone fit using optimization does a good job, however Hamiltonian Monte Carlo and Variational sampling from pymc3 is not fitting the data. The trace is stuck at where the prior is initiated. When I move the prior the posterior moves to the same spot. The posterior predictive of the bayesian model in cell 59 is barely getting the sine wave, whereas the non-bayesian fit model gets it near perfect in cell 63. I created a notebook here: https://gist.github.com/tomc4yt/d2fb694247984b1f8e89cfd80aff8706 which shows the code and the results.
Here is a snippet of the model below...
class GaussWeights(object):
def __init__(self):
self.count = 0
def __call__(self, shape, name='w'):
return pm.Normal(
name, mu=0, sd=.1,
testval=np.random.normal(size=shape).astype(np.float32),
shape=shape)
def build_ann(x, y, init):
with pm.Model() as m:
i = Input(tensor=x, shape=x.get_value().shape[1:])
m = i
m = Dense(4, init=init, activation='tanh')(m)
m = Dense(1, init=init, activation='tanh')(m)
sigma = pm.Normal('sigma', 0, 1, transform=None)
out = pm.Normal('out',
m, 1,
observed=y, transform=None)
return out
with pm.Model() as neural_network:
likelihood = build_ann(input_var, target_var, GaussWeights())
# v_params = pm.variational.advi(
# n=300, learning_rate=.4
# )
# trace = pm.variational.sample_vp(v_params, draws=2000)
start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
step = pm.HamiltonianMC(scaling=start)
trace = pm.sample(1000, step, progressbar=True)
The model contains normal noise with a fixed std of 1:
out = pm.Normal('out', m, 1, observed=y)
but the dataset does not. It is only natural that the predictive posterior does not match the dataset, they were generated in a very different way. To make it more realistic you could add noise to your dataset, and then estimate sigma:
mu = pm.Deterministic('mu', m)
sigma = pm.HalfCauchy('sigma', beta=1)
pm.Normal('y', mu=mu, sd=sigma, observed=y)
What you are doing right now is similar to taking the output from the network and adding standard normal noise.
A couple of unrelated comments:
out is not the likelihood, it is just the dataset again.
If you use HamiltonianMC instead of NUTS, you need to set the step size and the integration time yourself. The defaults are not usually useful.
Seems like keras changed in 2.0 and this way of combining pymc3 and keras does not seem to work anymore.