MSELoss when mask is used - neural-network

I'm trying to calculate MSELoss when mask is used. Suppose that I have tensor with batch_size of 2: [2, 33, 1] as my target, and another input tensor with the same shape. Since sequence length might differ for each instance, I have also a binary mask indicating the existence of each element in the input sequence. So here is what I'm doing:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
mse_loss_val = loss / loss.numel()
# now doing backpropagation
mse_loss_val.backward()
Is loss / loss.numel() a good practice? I'm skeptical, as I have to use reduction='none' and when calculating final loss value, I think I should calculate the loss only considering those loss elements that are nonzero (i.e., unmasked), however, I'm taking the average over all tensor elements with torch.numel(). I'm actually trying to take 1/n factor of MSELoss into account. Any thoughts?

There are some issues in the code. I think correct code should be:
mse_loss = nn.MSELoss(reduction='none')
loss = mse_loss(input, target)
loss = (loss * mask.float()).sum() # gives \sigma_euclidean over unmasked elements
non_zero_elements = mask.sum()
mse_loss_val = loss / non_zero_elements
# now doing backpropagation
mse_loss_val.backward()
This is only slightly worse than using .mean() if you are worried about numerical errors.

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Bicoin price prediction using spark and scala [duplicate]

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Calculating a interest rate tree in matlab

I would like to calibrate a interest rate tree using the optimization tool in matlab. Need some guidance on doing it.
The interest rate tree looks like this:
How it works:
3.73% = 2.5%*exp(2*0.2)
96.40453 = (0.5*100 + 0.5*100)/(1+3.73%)
94.15801 = (0.5*96.40453+ 0.5*97.56098)/(1+2.50%)
The value of 2.5% is arbitrary and the upper node is obtained by multiplying with an exponential of 2*volatility(here it is 20%).
I need to optimize the problem by varying different values for the lower node.
How do I do this optimization in Matlab?
What I have tried so far?
InterestTree{1}(1,1) = 0.03;
InterestTree{3-1}(1,3-1)= 2.5/100;
InterestTree{3}(2,:) = 100;
InterestTree{3-1}(1,3-2)= (2.5*exp(2*0.2))/100;
InterestTree{3-1}(2,3-1)=(0.5*InterestTree{3}(2,3)+0.5*InterestTree{3}(2,3-1))/(1+InterestTree{3-1}(1,3-1));
j = 3-2;
InterestTree{3-1}(2,3-2)=(0.5*InterestTree{3}(2,j+1)+0.5*InterestTree{3}(2,j))/(1+InterestTree{3-1}(1,j));
InterestTree{3-2}(2,3-2)=(0.5*InterestTree{3-1}(2,j+1)+0.5*InterestTree{3-1}(2,j))/(1+InterestTree{3-2}(1,j));
But I am not sure how to go about the optimization. Any suggestions to improve the code, do tell me..Need some guidance on this..
Are you expecting the tree to increase in size? Or are you just optimizing over the value of the "2.5%" parameter?
If it's the latter, there are two ways. The first is to model the tree using a closed form expression by replacing 2.5% with x, which is possible with the tree. There are nonlinear optimization toolboxes available in Matlab (e.g. more here), but it's been too long since I've done this to give you a more detailed answer.
The seconds is the approach I would immediately do. I'm interpreting the example you gave, so the equations I'm using may be incorrect - however, the principle of using the for loop is the same.
vol = 0.2;
maxival = 100;
val1 = zeros(1,maxival); %Preallocate
finalval = zeros(1,maxival);
for ival=1:maxival
val1(ival) = i/1000; %Use any scaling you want. This will go from 0.1% to 10%
val2=val1(ival)*exp(2*vol);
x1 = (0.5*100+0.5*100)/(1+val2); %Based on the equation you gave
x2 = (0.5*100+0.5*100)/(1+val1(ival)); %I'm assuming this is how you calculate the bottom node
finalval(ival) = x1*0.5+x2*0.5/(1+...); %The example you gave isn't clear, so replace this with whatever it should be
end
[maxval, indmaxval] = max(finalval);
The maximum value is in maxval, and the interest that maximized this is in val1(indmaxval).

Scale Factor in Matlabs `conv()`

I have the following code which is used to deconvolve a signal. It works very well, within my error limit...as long as I divide my final result by a very large factor (11000).
width = 83.66;
x = linspace(-400,400,1000);
a2 = 1.205e+004 ;
al = 1.778e+005 ;
b1 = 94.88 ;
c1 = 224.3 ;
d = 4.077 ;
measured = al*exp(-((abs((x-b1)./c1).^d)))+a2;
rect = #(x) 0.5*(sign(x+0.5) - sign(x-0.5));
rt = rect(x/83.66);
signal = conv(rt,measured,'same');
check = (1/11000)*conv(signal,rt,'same');
Here is what I have. measured represents the signal I was given. Signal is what I am trying to find. And check is to verify that if I convolve my slit with the signal I found, I get the same result. If you use what I have exactly, you will see that the check and measured are off by that factor of 11000~ish that I threw up there.
Does anyone have any suggestions. My thoughts are that the slit height is not exactly 1 or that convolve will not actually effectively deconvolve, as I request it to. (The use of deconv only gives me 1 point, so I used convolve instead).
I think you misunderstand what conv (and probably also therefore deconv) is doing.
A discrete convolution is simply a sum. In fact, you can expand it as a sum, using a couple of explicit loops, sums of products of the measured and rt vectors.
Note that sum(rt) is not 1. Were rt scaled to sum to 1, then conv would preserve the scaling of your original vector. So, note how the scalings pass through here.
sum(rt)
ans =
104
sum(measured)
ans =
1.0231e+08
signal = conv(rt,measured);
sum(signal)
ans =
1.0640e+10
sum(signal)/sum(rt)
ans =
1.0231e+08
See that this next version does preserve the scaling of your vector:
signal = conv(rt/sum(rt),measured);
sum(signal)
ans =
1.0231e+08
Now, as it turns out, you are using the same option for conv. This introduces an edge effect, since it truncates some of the signal so it ends up losing just a bit.
signal = conv(rt/sum(rt),measured,'same');
sum(signal)
ans =
1.0187e+08
The idea is that conv will preserve the scaling of your signal as long as the kernel is scaled to sum to 1, AND there are no losses due to truncation of the edges. Of course convolution as an integral also has a similar property.
By the way, where did that quoted factor of roughly 11000 come from?
sum(rt)^2
ans =
10816
Might be coincidence. Or not. Think about it.

why is the vector coming out of 'trapz' function as NAN?

i am trying to calculate the inverse fourier transform of the vector XRECW. for some reason i get a vector of NANs.
please help!!
t = -2:1/100:2;
x = ((2/5)*sin(5*pi*t))./((1/25)-t.^2);
w = -20*pi:0.01*pi:20*pi;
Hw = (exp(j*pi.*(w./(10*pi)))./(sinc(w./(10*pi)))).*(heaviside(w+5*pi)-heaviside(w-5*pi));%low pass filter
xzohw = 0;
for q=1:20:400
xzohw = xzohw + x(q).*(2./w).*sin(0.1.*w).*exp(-j.*w*0.2*((q-1)/20)+0.5);%calculating fourier transform of xzoh
end
xzohw = abs(xzohw);
xrecw = abs(xzohw.*Hw);%filtering the fourier transform high frequencies
xrect=0;
for q=1:401
xrect(q) = (1/(2*pi)).*trapz(xrecw.*exp(j*w*t(q))); %inverse fourier transform
end
xrect = abs(xrect);
plot(t,xrect)
Here's a direct answer to your question of "why" there is a nan. If you run your code, the Nan comes from dividing by zero in line 7 for computing xzohw. Notice that w contains zero:
>> find(w==0)
ans =
2001
and you can see in line 7 that you divide by the elements of w with the (2./w) factor.
A quick fix (although it is not a guarantee that your code will do what you want) is to avoid including 0 in w by using a step which avoids zero. Since pi is certainly not divisible by 100, you can try taking steps in .01 increments:
w = -20*pi:0.01:20*pi;
Using this, your code produces a plot which might resemble what you're looking for. In order to do better, we might need more details on exactly what you're trying to do, or what these variables represent.
Hope this helps!