So I am currently trying to implement my first NN with a genetic algorithm for training and a sigmoid activation function. It's all good but I'm not quite sure in what ranges the weights must be. I've searched some about the question but with no luck. How does one choose the ranges of the weights in a NN? What does it depend on?
The weights can be seen as an intrinsic property of the problem you're trying to solve using the GA/NN approach; there's no general best value fo these, so you're best off studying different weight spans (w.r.t. training sets) with other parameters fixed.
E.g., study different settings for parameter weightSpan in
weights \in [-weightSpan/2, weightSpan/2],
and let your initial chromosomes describe weights with randomized values in this range. Your squashing function (sigmoid) is used to grade the NN response to the range [0, 1].
Finding an appropriate weight span is, much like setting a value of number of hidden layer, a process if problem-specific testing. ("There is no free lunch").
Edit:
I thought I'd add that the easiest way to study different weight spans is probably to set a fixed weight span, say [-1, 1], and study the squashing constant in your squashing function (sigmoid). I.e., study different (non-negative) values of constant c in your sigmoid
σ(s) = 1 / (1 + e^(-c*s))
Related
I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0
Is it better for Neural Network to use smaller range of training data or it does not matter? For example, if I want to train an ANN with angles (values of float) should I pass those values in degrees [0; 360] or in radians [0; 6.28] or maybe all values should be normalized to range [0; 1]? Does the range of training data affects ANN learing quality?
My Neural Network has 6 input neurons, 1 hidden layer and I am using sigmoid symmetric activation function (tanh).
For the neural network it doesn't matter whether the data is normalised.
However, the performance of the training method can vary a lot.
In a nutshell: typically the methods prefer variables which have larger values. This might send the training method off-track.
Crucial for most NN training methods is that all dimensions of the training data have the same domain. If all your variables are angles it doesn't matter, whether they are [0,1) or [0,2*pi) or [0,360) as long as they have the same domain. However, you should avoid having one variable for the angle [0,2*pi) and another variable for the distance in mm where distance can be much larger then 2000000mm.
Two cases where an algorithm might suffer in these cases:
(a) regularisation: if the weights of the NN are force to be small a tiny change of a weight controlling the input of a large domain variable has a much larger impact, than for a small domain
(b) gradient descent: if the step size is limited you have similar effects.
Recommendation: All variables should have the same domain size whether it is [0,1] or [0,2*pi] or ... doesn't matter.
Addition: for many domain "z-score normalisation" works extremely well.
The data points range affects the way you train a model. Suppose the range of values for features in the data set is not normalized. Then, depending on your data, you may end up having elongated Ellipses for the data points in the feature space and the learning model will have a very hard time learning the manifold on which the data points lie on (learn the underlying distribution). Also, in most cases the data points are sparsely spread in the feature space, if not normalized (see this). So, the take-home message is to normalize the features when possible.
I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?
Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.
I have big data set (time-series, about 50 parameters/values). I want to use Kohonen network to group similar data rows. I've read some about Kohonen neural networks, i understand idea of Kohonen network, but:
I don't know how to implement Kohonen with so many dimensions. I found example on CodeProject, but only with 2 or 3 dimensional input vector. When i have 50 parameters - shall i create 50 weights in my neurons?
I don't know how to update weights of winning neuron (how to calculate new weights?).
My english is not perfect and I don't understand everything I read about Kohonen network, especially descriptions of variables in formulas, thats why im asking.
One should distinguish the dimensionality of the map, which is usually low (e.g. 2 in the common case of a rectangular grid) and the dimensionality of the reference vectors which can be arbitrarily high without problems.
Look at http://www.psychology.mcmaster.ca/4i03/demos/competitive-demo.html for a nice example with 49-dimensional input vectors (7x7 pixel images). The Kohonen map in this case has the form of a one-dimensional ring of 8 units.
See also http://www.demogng.de for a java simulator for various Kohonen-like networks including ring-shaped ones like the one at McMasters. The reference vectors, however, are all 2-dimensional, but only for easier display. They could have arbitrary high dimensions without any change in the algorithms.
Yes, you would need 50 neurons. However, these types of networks are usually low dimensional as described in this self-organizing map article. I have never seen them use more than a few inputs.
You have to use an update formula. From the same article: Wv(s + 1) = Wv(s) + Θ(u, v, s) α(s)(D(t) - Wv(s))
yes, you'll need 50 inputs for each neuron
you basically do a linear interpolation between the neurons and the target (input) neuron, and use W(s + 1) = W(s) + Θ() * α(s) * (Input(t) - W(s)) with Θ being your neighbourhood function.
and you should update all your neurons, not only the winner
which function you use as a neighbourhood function depends on your actual problem.
a common property of such a function is that it has a value 1 when i=k and falls off with the distance euclidian distance. additionally it shrinks with time (in order to localize clusters).
simple neighbourhood functions include linear interpolation (up to a "maximum distance") or a gaussian function
I was wondering, MATLAB has a removeconstantrows function that should be applied to feedforward neural network input and target output data. This function removes constant rows from the data. For example if one input vector for a 5-input neural network is [1 1 1 1 1] then it is removed.
Googling, the best explanation I could find is that (paraphrasing) "constant rows are not needed and can be replaced by appropriate adjustments to the biases of the output layer".
Can someone elaborate?
Who does this adjustment?
From my book, the weight adjustment for simple gradient descent is:
Δ weight_i = learning_rate * local_gradient * input_i
Which means that all weights of a neuron at the first hidden layer are adjusted the same amount. But they ARE adjusted.
I think there is a misundertanding. The "row" is not an input pattern, but a feature, that is i-th component in all patterns. It's obvious that if some feature does not have big variance on all data set, it does not provide valuable information and does not play a noticable role for network training.
The comparison to a bias is feasible (though I don't agree, that this applies to output layer (only), bacause it depends on where the constant row is found - if it's in input data, then it is right as well for the first hidden layer, imho). If you remeber, it's recommended for each neuron in backpropagation network to have a special bias weight, connected to 1 constant signal. If, for example, a training set contains a row with all 1-th, then this is the same as additional bias. If the constant row has a different value, then the bias will have different effect, but in any case you can simply eliminate this row, and add the constant value of the row into the existing bias.
Disclaimer: I'm not a Matlab user. My background in neural networks comes solely from programming area.