Function approximation by ANN - neural-network

So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.

Related

Neural network y=f(x) regression

Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.

Loss not decreasing : all sigmoid values close to 0 during training after few iterations

I am training a network for detecting sentence similarity (paraphrase detection) using joint loss from LSTM layer and CNN layer. The final cost is simply the summation of individual loss (likelihood loss) from these two layers.
Probability of two sentence to be similar: sigmoid(vec1TWvec2 + b), where vec1 and vec2 are vector representations of the two sentences, W and b are weights and biases to be learnt during training.
Final loss = Loss from LSTM layer + Loss from CNN layer.
When I train the system on sample data of 32 random sentences, my model converges well.
However, on using complete data the loss becomes stagnant and all
sigmoid values goes to very close to 0.
My network parameters:
Gradient Descent Optimizer with learning rate 0.01 or 0.001.
Hidden State Dim 200.
Word Embedding Dim 300.
Gradient Clipping by norm to 5.
1 Layer of convolution with 200 kernels followed by 1 layer of max pool.
Can anyone give my any hint on what might be the issue with training on complete data even though it works on small dataset? Is there a problem of vanishing gradient?

Normalization in neural network with (x, y) output

I built a backpropagation neural network to learn from a dataset that consists of 7 continuous inputs and 2 outputs (x, y coordinates). My implementation choice was to use one hidden layer with 7 neurons, but I did it in such a way that I can try different combinations of hidden layers with variable number of hidden nodes.
The error measurement is the usual mean squared error, calculated as follows:
MSE(x,y) = 1/N * sum((X - x)^2 + (Y - y)^2)
where X and Y are the target values, x and y the predictions. I also have to compute an accuracy measure which is the mean euclidean distance of each point from the target points, that's basically the same as the MSE, but the values inside sum get square-rooted.
The input ranges are all between the interval [-2, +2], plus some outliers.
The output coordinates have completely unrelated distributions (x is normally distributed while y is uniformly distributed). The x range is small (say -1, +1 from the mean) while the y range varies more (say -10, +10 from the mean).
The behavior I get is that the net seems to predict quite right the y output, while the x "flattens" to y. Ie, the x values get closer to the y values, the network doesn't adapt to predict the x correctly.
My initial choice was to scale both inputs/outputs as a whole to the usual (0,1) interval but that didn't lead to good results. So I then chose to standardize each feature separately with their z-score, and scale the outputs in the (0,1) interval (I am using the sigmoid activation function so (0,1) seemed about right). But then this strange behavior appeared.
So my questions are, how would you normalize such inputs/outputs? Is there a way to deal with such uncorrelated outputs? I had even thought about using two separate networks to predict one single output discarding the other, is that a good choice?
Could you also point me to some reading where output normalization is discussed? The literature talks a lot about normalizing the inputs, but no one seems to care about the outputs.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?
Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)
This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.
Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.