There have three common image data normalization methods, which are
1. X = (X - X.mean) / X.std
2. X /= 255. # (based on formula: (X - min) / (max - min) which can converge data into [0, 1].)
3. X = 2 * (X - min) / (max - min) - 1 # converge into [-1, 1]
I found in different CNN tutorials or posts, people may use one of them to normalize data. But I am a bit confused about them, how should I select one in different situations?
Thanks for any explanations in advance.
Broadly speaking, the reason we normalize the images is to make the model converge faster. When the data is not normalized, the shared weights of the network have different calibrations for different features, which can make the cost function to converge very slowly and ineffectively. Normalizing the data makes the cost function much easier to train.
Exactly which normalization method you choose depends on the data that you are dealing with and the assumptions you make about that data. All the above three normalization methods are based on two ideas, that are, centering and scaling. Method 2. involves only scaling the data into a particular range. This makes sure that the scale of the various features is in a similar range and hence gives stable gradients. Method 1. involves centering the data around the mean datapoint and then dividing each dimension of the datapoint with its standard deviation so that all the dimensions hold equal importance for the learning algorithm. This normalization is more effective when you have a reason to believe that different dimensions in the data have vastly different range. Bringing all the dimensions in the same range thus make sharing of the parameters effective. Method 3 can also be seen as somewaht doing the sam job as method 1.
Related
I have few questions regarding the theory behind neural networks' gradient descent.
First question: Lets say we have 5 weights one for each of the 5 features. And now we want to compute the gradient. How does the algorithm internally do it? Does it take the first weight (=W1) and tries increasing it a bit (or decreasing it) and when it is done, goes to the 2nd weight? Or does it do it differently and more efficiently by changing simultaneously more than 1 weights?
Second question: If feature 1 is way way more important that feature 2, so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight? If we have only one learning rate, we set it by taking account only the most impactful weight, right?
For question 1:
It just does gradient descent. You don't wiggle weights independently: you stack your weights in a vector/matrix/tensor W an compute and increment delta_W which itself is a (respectively) vector/matrix/tensor. Once you know this increment you apply it to all weights at once.
For question 2:
There are already many algorithms that tune the learning rate to parameters. See for example RMSprop and Adam. Those are usually (roughly said) based on the frequency at which a parameter intervenes.
Regarding the "importance" that you describe:
so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight
You are just describing gradient! In that case W1 has a higher gradient than W2, and it already is being updated with a higher weight, so to speak. It wouldn't make much sense though to play around with its learning rate independently unless you have more information about its role (e.g. the frequency mentinoed above).
I am using caffe , in detail pycaffe, to create my neuronal network. I noticed that I have to use BatchNormLayer to get a positive result. I am using the Kappa-Score as a result matrix.
I now have seen several different locations for the BatchNorm-Layers in my network. But I came across the ScaleLayer, too which is not in the Layer Catalogue but gets often mentioned with the BatchNorm Layer
Do you always need to put a ScaleLayer after a BatchNorm - Layer and what does it do?
From the original batch normalization paper by Ioffe & Szegedy: "we make sure that the transformation inserted in the network can represent the identity transform." Without the Scale layer after the BatchNorm layer, that would not be the case because the Caffe BatchNorm layer has no learnable parameters.
I learned this from the Deep Residual Networks git repo; see item 6 under disclaimers and known issues there.
In general, you will get no benefit from a scale layer juxtaposed with batch normalization. Each is a linear transformation. Where BatchNorm translates so that the new distribution has a mean of 0 and variance of 1, Scale compresses the entire range into a specified interval, typically [0,1]. Since they're both linear transformations, if you do them in sequence, the second will entirely undo the work of the first.
They also deal somewhat differently with outliers. Consider a set of data: ten values, five each of -1 and +1. BatchNorm will not change this at all: it already has mean 0 and variance 1. For consistency, let's specify the same interval for Scale, [-1, 1], which is also a popular choice.
Now, add an outlier of, say 99 to the mix. Scale will transform the set to the range [-1, 1] so that there are now five -1.00 values, one +1.00 value (the former 99), and five values of -0.96 (formerly +1).
BatchNorm worries about the mean standard deviation, not the max and min values. The new mean is +9; the S.D. is 28.48 (rounding everything to 2 decimal places). The numbers will be scaled to be roughly five values each of -.35 and -.28, and one value of 3.16
Whether one scaling works better than the other depends much on the skew and scatter of your distribution. I prefer BatchNorm, as it tends to differentiate better in dense regions of a distribution.
I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0
So I am currently trying to implement my first NN with a genetic algorithm for training and a sigmoid activation function. It's all good but I'm not quite sure in what ranges the weights must be. I've searched some about the question but with no luck. How does one choose the ranges of the weights in a NN? What does it depend on?
The weights can be seen as an intrinsic property of the problem you're trying to solve using the GA/NN approach; there's no general best value fo these, so you're best off studying different weight spans (w.r.t. training sets) with other parameters fixed.
E.g., study different settings for parameter weightSpan in
weights \in [-weightSpan/2, weightSpan/2],
and let your initial chromosomes describe weights with randomized values in this range. Your squashing function (sigmoid) is used to grade the NN response to the range [0, 1].
Finding an appropriate weight span is, much like setting a value of number of hidden layer, a process if problem-specific testing. ("There is no free lunch").
Edit:
I thought I'd add that the easiest way to study different weight spans is probably to set a fixed weight span, say [-1, 1], and study the squashing constant in your squashing function (sigmoid). I.e., study different (non-negative) values of constant c in your sigmoid
σ(s) = 1 / (1 + e^(-c*s))
I have big data set (time-series, about 50 parameters/values). I want to use Kohonen network to group similar data rows. I've read some about Kohonen neural networks, i understand idea of Kohonen network, but:
I don't know how to implement Kohonen with so many dimensions. I found example on CodeProject, but only with 2 or 3 dimensional input vector. When i have 50 parameters - shall i create 50 weights in my neurons?
I don't know how to update weights of winning neuron (how to calculate new weights?).
My english is not perfect and I don't understand everything I read about Kohonen network, especially descriptions of variables in formulas, thats why im asking.
One should distinguish the dimensionality of the map, which is usually low (e.g. 2 in the common case of a rectangular grid) and the dimensionality of the reference vectors which can be arbitrarily high without problems.
Look at http://www.psychology.mcmaster.ca/4i03/demos/competitive-demo.html for a nice example with 49-dimensional input vectors (7x7 pixel images). The Kohonen map in this case has the form of a one-dimensional ring of 8 units.
See also http://www.demogng.de for a java simulator for various Kohonen-like networks including ring-shaped ones like the one at McMasters. The reference vectors, however, are all 2-dimensional, but only for easier display. They could have arbitrary high dimensions without any change in the algorithms.
Yes, you would need 50 neurons. However, these types of networks are usually low dimensional as described in this self-organizing map article. I have never seen them use more than a few inputs.
You have to use an update formula. From the same article: Wv(s + 1) = Wv(s) + Θ(u, v, s) α(s)(D(t) - Wv(s))
yes, you'll need 50 inputs for each neuron
you basically do a linear interpolation between the neurons and the target (input) neuron, and use W(s + 1) = W(s) + Θ() * α(s) * (Input(t) - W(s)) with Θ being your neighbourhood function.
and you should update all your neurons, not only the winner
which function you use as a neighbourhood function depends on your actual problem.
a common property of such a function is that it has a value 1 when i=k and falls off with the distance euclidian distance. additionally it shrinks with time (in order to localize clusters).
simple neighbourhood functions include linear interpolation (up to a "maximum distance") or a gaussian function