MLP Neural network not training correctly, probably converging to a local minimum

MLP Neural network not training correctly, probably converging to a local minimum - matlab

I'm making a MLP neural network with back-propagation in matlab. The problem is, it seems not to be able to handle the curves in a function well, and also doesn't scale well with the values. It can for example reach 80% of the cos(x) but if I put 100*cos(x) it will just not train at all.
What is even weirder is, that some functions it can train well to, while others it just doesn't work at all..
For example:
Well trained: http://img515.imageshack.us/img515/2148/coscox3.jpg
Not so well: http://img252.imageshack.us/img252/5370/cos2d.jpg (smoothness from being left a long time)
Wrong results, stuck like this: http://img717.imageshack.us/img717/2145/ex2ug.jpg
This is the algo I'm trying to implement:
http://img594.imageshack.us/img594/9590/13012012001.jpg
http://img27.imageshack.us/img27/954/13012012002.jpg
And this is my implementation:
close all;clc;
j=[4,3,1]; %number neurons in hidden layers and output layer
i=[1,j(1),j(2)];
X=0:0.1:pi;
d=cos(X);
%-----------Weights------------%
%-----First layer weights------%
W1p=rand([i(1)+1,j(1)]);
W1p=W1p/sum(W1p(:));
W1=rand([i(1)+1,j(1)]);
W1=W1/sum(W1(:));
%-----Second layer weights------%
W2p=rand([i(2)+1,j(2)]);
W2p=W2p/sum(W2p(:));
W2=rand([i(2)+1,j(2)]);
W2=W2/sum(W2(:));
%-----Third layer weights------%
W3p=rand([i(3)+1,j(3)]);
W3p=W3p/sum(W3p(:));
W3=rand([i(3)+1,j(3)]);
W3=W3/sum(W3(:));
%-----------/Weights-----------%
V1=zeros(1,j(1));
V2=zeros(1,j(2));
V3=zeros(1,j(3));
Y1a=zeros(1,j(1));
Y1=[0 Y1a];
Y2a=zeros(1,j(2));
Y2=[0 Y2a];
O=zeros(1,j(3));
e=zeros(1,j(3));
%----Learning and forgetting factor-----%
alpha=0.1;
etha=0.1;
sortie=zeros(1,length(X));
while(1)
n=randi(length(X),1);
%---------------Feed forward---------------%
%-----First layer-----%
X0=[-1 X(:,n)];
V1=X0*W1;
Y1a=tanh(V1/2);
%----Second layer-----%
Y1=[-1 Y1a];
V2=Y1*W2;
Y2a=tanh(V2/2);
%----Output layer-----%
Y2=[-1 Y2a];
V3=Y2*W3;
O=tanh(V3/2);
e=d(n)-O;
sortie(n)=O;
%------------/Feed Forward-----------------%
%------------Backward propagation---------%
%----Output layer-----%
delta3=e*0.5*(1+O)*(1-O);
W3n=W3+ alpha*(W3-W3p) + etha * delta3 * W3;
%----Second Layer-----%
delta2=zeros(1,length(Y2a));
for b=1:length(Y2a)
delta2(b)=0.5*(1-Y2a(b))*(1+Y2a(b)) * sum(delta3*W3(b+1,1));
end
W2n=W2 + alpha*(W2-W2p)+ (etha * delta2'*Y1)';
%----First Layer-----%
delta1=zeros(1,length(Y1a));
for b=1:length(Y1a)
for m=1:length(Y2a)
delta1(b)=0.5*(1-Y1a(b))*(1+Y1a(b)) * sum(delta2(m)*W2(b+1,m));
end
end
W1n=W1+ alpha*(W1-W1p)+ (etha * delta1'*X0)';
W3p=W3;
W3=W3n;
W2p=W2;
W2=W2n;
W1p=W1;
W1=W1n;
figure(1);
plot(1:length(d),d,1:length(d),sortie);
drawnow;
end
My question is, what can I do to correct it?
My guesses so far are, I either have something wrong in the back propagation, specifically in calculating delta and the weights. Or I have the weights initialized wrong (too small, or not dependent on the initial input)..

I am not an expert in this field, but have had some experience playing with Matlab and Java based Neural Network Systems.
I can suggest that usage of the toolbox could help you, it has helped others that I know.
I can offer a few points of information:
Do not expect NN's to work on all training data, sometimes the data is too complicated for classification in this manner
The format of your NN will have a drastic impact on the convergence performace
Finally:
Training algorithms like this will often train better when the various parameters are normalized to +/- 1. cos(x) is normalized, 100*cos*(x) is not. This is because the weighting updates required are much larger, and the training system might be taking very small steps. If you are data with multiple different ranges, then normalization is vital. Might I suggest you start with, at the very least, investigating that

Related

How do I compare two weighted regressions in MatLab?

I've been using MatLab as a statistics tool. I like how much I can customise and code myself.
I was delighted to find that it's fairly straightforward to do a weighted linear regression in MatLab. As a slightly silly example, I can load the "carbig" data file and compare horsepower vs mileage for US cars to that of cars from other countries, but decide I only trust 8-cylinder cars.
load carbig
w=(Cylinders==8)+0.5*(Cylinders~=8)%1 if 8 cylinders, 0.5 otherwise.
for i=1:length(org)
o(i,1)=strcmp(org(i,:),org(1,:));%strcmp only works on one string.
end
x1=Horsepower(o==1)
x2=Horsepower(o==0)
y1=MPG(o==1)
y2=MPG(o==0)
w1=w(o==1)
w2=w(o==0)
lm1=fitlm(x1,y1,'Weights',w1)
lm2=fitlm(x2,y2,'Weights',w2)
This way, data from 8-cylinder cars will count as one data-point, and data frm 3,4,5,6-cylinder cars will count as half a data point.
Problem is, the obvious way to compare the two regressions is to use ANCOVA, which MatLab has a function for:
aoctool(Horsepower,MPG,o)
This function compares linear regressions on the two groups, but I haven't found an obvious way to include weights.
I suspect I can have a closer look at what the ANCOVA does and include the weights manually. Any easier solution?

I figured if I give the "trusted" measuremets weight 2, the "untrusted" measurements weight 1, for regression purposes that's the same thing as having an extra 1 identical measurement for each trusted one. Setting the weight to 1 and 0.5 should do the same thing. I can do this with a script.
That also increases the degrees of freedom quite a bit, so I manually set the degrees of freedom to sum(w)-rank instead on n-rank.
x=[];
y=[];
g=[];
w=(Cylinders==8)+0.5*(Cylinders~=8);
df=sum(w)
for i=1:length(w)
while w(i)>0
x=[x;Horsepower(i)];
y=[y;MPG(i)];
g=[g;o(i)];
w(i)=w(i)-0.5
end
end
I then copied the aoctool.m file (edit aoctool) and inserted the value of df somewhere in the new file. It isn't elegant, but it seems to work.
edit aoctool.m
%(insert new df somewhere. Save as aoctool2.m)
aoctool2(x,y,g)

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.
[[ 25.26000023 26.37000084 24.67000008 23.30999947]
[ 26.37000084 24.67000008 23.30999947 21.36000061]
[ 24.67000008 23.30999947 21.36000061 19.77000046]...]
So in the data above it is a time window of 4 inputs in an array. My input layer is
model.add(Dense(4, input_dim=4, activation='sigmoid'))
model.fit(trainX, trainY, nb_epoch=10000,verbose=2,batch_size=4)
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?
Thanks John

and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?
Yes, each epoch is iteration over all training samples
and does the batch_size need to be 4 in order for this time window to work?
No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.
Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is
grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =
= 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)
where loss is some loss function over your pred (prediction) as compared to y_i.
And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form
1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)
As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.

Let me explain this with an example:
When you have 32 training examples and you call model.fit with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.
In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.

How to implement weight decay in tensorflow as in Caffe

In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by:
W = W * (1 - 0.0005)
after we applied the gradient to it.
I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)
I my experiences, I often run into numerical problems aften 100k iterations during training.
I also go through related questions at stackoverflow, e.g.,
How to set weight cost strength in TensorFlow?
However, the solution seems a little different as implemented in Caffe.
Does anyone has similar concerns? Thank you.

The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.

This is a duplicate question:
How to define weight decay for individual layers in TensorFlow?
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

Local volatility model Matlab

I am trying to do a Monte Carlo simulation of a local volatility model, i.e.
dSt = sigma(St,t) * St dWt .
Unfortunately the Matlab package class sde can not be applied, as the function is rather complex.
For this reason I am simulating this SDE manually with the Euler-Mayurama method. More specifically I used Ito's formula to get an SDE for the log-process Xt=log(St)
dXt = -1/2 sigma^2(exp(Xt),t) dt + sigma(exp(Xt),t) dWt
The code for this is the following:
function [S]=geom_bb(sigma,T,N,m)
% T.. Time horizon, sigma.. standard deviation, N.. timesteps, m.. dimensions
X=zeros(N+1,m);
dt=T/N;
t=(0:N)'*dt;
dW=randn(N,m);
for j=1:N
X(j+1,:)=X(j,:) - 1/2* sigma(exp(X(j,:)),t(j))^2 * sqrt(dt) + sigma(exp(X(j,:)),t(j))*dW(j,:);
end
S=exp(X*sqrt(dt));
end
This code works rather good for small sigma, however for sigma around 10 the process S always tends to zero. This should not happen as S is a martingale, and therefore has expectation =1 (at least for constant sigma).
However X should be simulated correctly, as the mean is exact.
Can anyone help me with this issue? Is this only due to numerical rounding errors? Is there another simulation method that should be preferred to solve this problem?

First are you sure S=exp(X*sqrt(dt)) outside the loop is doing what you want ? Why not have it inside the loop to start with ? You're using the exp(X) for sigma() inside the loop in any case, which is now missing the sqrt(dt).
Beyond that, suggested ways to improve behavior: use the Milstein scheme instead, increase the number of timesteps, make sure your sigma() value is commensurate with your timestep. Sigma of 10 means 1000% volatility, i.e. moves of 60% per day. Assuming dt is more than a few minutes, this simply can't be good.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers

This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers

You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.