I am learning to build neural networks for regression problems. It works well approximating linear functions. Setup with 1-5–1 units with linear activation functions in hidden and output layers does the trick and results are fast and reliable. However, when I try to feed it simple quadratic data (f(x) = x*x) here is what happens:
With linear activation function, it tries to fit a linear function through dataset
And with TANH function it tries to fit a a TANH curve through the dataset.
This makes me believe that the current setup is inherently unable to learn anything but a linear relation, since it's repeating the shape of activation function on the chart. But this may not be true because I've seen other implementations learn curves just perfectly. So I may be doing something wrong. Please provide your guidance.
About my code
My weights are randomized (-1, 1) inputs are not normalized. Dataset is fed in random order. Changing learning rate or adding layers, does not change the picture much.
I've created a jsfiddle,
the place to play with is this function:
function trainingSample(n) {
return [[n], [n]];
}
It produces a single training sample: an array of an input vector array and a target vector array.
In this example it produces an f(x)=x function. Modify it to be [[n], [n*n]] and you've got a quadratic function.
The play button is at the upper right, and there also are two input boxes to manually input these values. If target (right) box is left empty, you can test the output of the network by feedforward only.
There is also a configuration file for the network in the code, where you can set learning rate and other things. (Search for var Config)
It's occurred to me that in the setup I am describing, it is impossible to learn non–linear functions, because of the choice of features. Nowhere in forward pass we have input dependency of power higher than 1, that's why I am seeing a snapshot of my activation function in the output. Duh.
Related
I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.
I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).
I am trying to make a simple radial basis function network (RBFN) for regression. I have a 20 dimensional (feature) dataset with over 600 samples. I need the final network to output 1 scalar value for each 20 dimensional sample.
Note: new to machine learning...and feel like I am missing an important concept here.
With the perceptron we can, and I have, trained a linear network until the prediction error is at a minimum using a small subset of the initial samples.
Is there a similar process with the RBFN?
Yes there is,
The main two differences between a multi-layer perceptron and a RBFN are the fact that a RBFN usually implies just one layer and that the activation function is a gaussian instead of a sigmoid.
The training phase can be done using gradient descend of the error loss function, so it is relatively simple to implement.
Keep in mind that RBFN is a linear combination of RBF units, so the range of the output is limited and you would need to transform it if you need an scalar outside of that range.
There is a few of resources that you could consult as reference:
[PDF] (http://scholar.lib.vt.edu/theses/available/etd-6197-223641/unrestricted/Ch3.pdf)
[Wikipedia] (http://en.wikipedia.org/wiki/Radial_basis_function_network)
[Wolfram] (http://reference.wolfram.com/applications/neuralnetworks/NeuralNetworkTheory/2.5.2.html)
Hope it helps,
I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps
I just started playing around with neural networks and, as I would expect, in order to train a neural network effectively there must be some relation between the function to approximate and activation function.
For instance, I had good results using sin(x) as an activation function when approximating cos(x), or two tanh(x) to approximate a gaussian. Now, to approximate a function about which I know nothing I am planning to use a cocktail of activation functions, for instance a hidden layer with some sin, some tanh and a logistic function. In your opinion does this make sens?
Thank you,
Tunnuz
While it is true that different activation functions have different merits (mainly for either biological plausibility or a unique network design like radial basis function networks), in general you be able to use any continuous squashing function and expect to be able to approximate most functions encountered in real world training data.
The two most popular choices are the hyperbolic tangent and the logistic function, since they both have easily calculable derivatives and interesting behavior around the axis.
If neither if those allows you to accurately approximate your function, my first response wouldn't be to change activation functions. Rather, you should first investigate your training set and network training parameters (learning rates, number of units in each pool, weight decay, momentum, etc.).
If your still stuck, step back and make sure your using the right architecture (feed forward vs. simple recurrent vs. full recurrent) and learning algorithm (back-propagation vs. back-prop through time vs. contrastive hebbian vs. evolutionary/global methods).
One side note: Make sure you never use a linear activation function (except for output layers or crazy simple tasks), as these have very well documented limitations, namely the need for linear separability.