Why scaling data is very important in neural network(LSTM)

Why scaling data is very important in neural network(LSTM) - neural-network

I am writing my master thesis about how to apply LSTM neural network in time series. In my experiment, i found out that scaling data can have a great impact on the result. For example, when i use a tanh activation function, and the value range is between -1 and 1, the model seems to converge faster and the validation error also does not jump dramatically after each epoch.
Does anyone know is there any mathmetical explanation for that? Or is there any papers already explain about this situation?

Your question reminds me of a picture used in our class, but you can find a similar one from here at 3:02.
In the picture above you can see obviously that the path on the left is much longer than that on the right. The scaling is applied to the left to become the right one.

may the point is nonlinearity. my approach is from chaos theory ( fractals , multifractals,... ) and the range of input and parameter values of a nonlinear dynamical system have strong influence on the system behavior. this is because of the nonlinearity, in case of tanh the type of nonlinearity in the interval [-1,+1] is different than in other intervals, i.e. in the range [10,infinity) it is approx. a constant.
any nonlinear dynamical system is only valid in a specific range for both parameters and initial value, see i.e. the logistic map. Depending on the range of parameter values and initial values the behavior of the logistic map is completely different, this is the sensitivity to initial conditions
RNNs can be regarded as nonlinear self-referential systems.
in general there are some remarkable similarities between nonlinear dynamical systems and neural networks, i.e. the fading memory property of Volterra series models in Nonlinear Systems Identification and the vanishing gradient in recurrent neural networks
strongly chaotic systems have the sensitivity to initial conditions property and it is not possible to reproduce this heavily nonlinear behavior neither by Volterra series nor by RNNs because of the fading memory, resp. the vanishing gradient
so the mathematical background could be that a nonlinearity is more 'active' in the range of a specific intervall while linearity is equally active anywhere ( it is linear or approx constant )
in the context of RNNs and monofractality / multifractality scaling has two different meanings. This is especially confusing because RNNs and nonlinear, self-referential systems are deeply linked
in the context of RNNs scaling means a limiting of the range of
input or output values in the sense of an affine transformation
in context of monofractality / multifractality scaling means that
the output of the nonlinear system has a specific structure that is
scale invariant in case of monofractals, self-affine in case of self-affine fractals ... where the scale is equivalent to a 'zoom level'
The link between RNNs and nonlinear self-referential systems is that they are both exactly that, nonlinear and self-referential.
in general sensitivity to initial conditions ( which is related to the sensitivity to scaling in RNNs ) and scale invariance in the resulting structures ( output ) only appears in nonlinear self-referential systems
the following paper is a good summary for multifractal and monofractal scaling in the output of a nonlinear self-referential system ( not to be confused with the scaling of input and output of RNNs ) : http://www.physics.mcgill.ca/~gang/eprints/eprintLovejoy/neweprint/Aegean.final.pdf
in this paper is a direct link between nonlinear systems and RNN : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4107715/ - Nonlinear System Modeling with Random Matrices: Echo State Networks Revisited

Related

Is it necessary to use a linear bottleneck layer for autoencoder?

I'm currently trying to use an autoencoder network for dimensionality reduction.
(i.e. using the bottleneck activation as the compressed feature)
I noticed that a lot of studies that used autoencoder for this task uses a linear bottleneck layer.
By intuition, I think this makes sense since the usage of non-linear activation function may reduce the bottleneck feature's capability to represent the principle information contained within the original feature.
(e.g., ReLU ignores the negative values and sigmoid suppresses values too high or too low)
However, is this correct? And is using linear bottleneck layer for autoencoder necessary?
If it's possible to use a non-linear bootleneck layer, what activation function would be the best choice?
Thanks.

No, you are not limited to linear activation functions. An example of that is this work, where they use the hidden state of the GRU layers as an embedding for the input. The hidden state is obtained by using non-linear tanh and sigmoid functions in its computation.
Also, there is nothing wrong with 'ignoring' the negative values. The sparsity may, in fact, be beneficial. It can enhance the representation. The noise that can be created by other functions such as identity or sigmoid function may introduce false dependencies where there are none. By using ReLU we can represent the lack of dependency properly (as a zero) as opposed to some near zero value which is likely for e.g. sigmoid function.

Linear vs nonlinear neural network?

I'm new to machine learning and neural networks. I know how to build a nonlinear classification model, but my current problem has a continuous output. I've been searching for information on neural network regression, but all I encounter is information on linear regression - nothing about nonlinear cases. Which is odd, because why would someone use neural networks to solve a simple linear regression anyway? Isn't that like killing a fly with a nuclear bomb?
So my question is this: what makes a neural network nonlinear? (Hidden layers? Nonlinear activation function?) Or do I have a completely wrong understanding of the word "linear" - can a linear regression NN accurately model datasets that are more complex than y=aX+b? Is the word "linear" used just as the opposite of "logistic"?
(I'm planning to use TensorFlow, but the TensorFlow Linear Model Tutorial uses a binary classification problem as an example, so that doesn't help me either.)

For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.
A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.
The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.

WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.
Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.
First let me point out that we can talk about linearity in parameters or linearity in the variables.
The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).
For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):
Now for simplicity, let us consider a single neuron, single layer neural network:
If the transfer function is linear then:
As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):
Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):
(first layer)
(second layer)
Now notice:
Reduces to:
Where and
Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).
That being said, let's take a look at the following statement from #Pawu regarding this answer:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
While you could argue that what #Pawu is saying is technically true, I think they are implying:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.
I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.
Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:
The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.
Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:
#%% Packages
import numpy as np
#%% Data
data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]])
np.random.shuffle(data)
train_data = data[:,:2]
target_data = data[:,2]
#%% XOR architecture
class XOR_class():
def __init__(self, train_data, target_data, alpha=.1, epochs=10000):
self.train_data = train_data
self.target_data = target_data
self.alpha = alpha
self.epochs = epochs
#Random weights
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
#xor network (linear transfer functions only)
def xor_network(self, X0):
n0 = np.dot(X0, self.W0) + self.b0
X1 = n0*X0
a = np.dot(X1, self.W2) + self.b2
return(a, X1)
#Training the xor network
def train(self):
for epoch in range(self.epochs):
for i in range(len(self.train_data)):
# Forward Propagation:
X0 = self.train_data[i]
a, X1 = self.xor_network(X0)
# Backward Propagation:
e = self.target_data[i] - a
s_2 = -2*e
# Update Weights:
self.W0 = self.W0 - (self.alpha*s_2*X0)
self.b0 = self.b0 - (self.alpha*s_2)
self.W2 = self.W2 - (self.alpha*s_2*X1)
self.b2 = self.b2 - (self.alpha*s_2)
#Restart training if we get lost in the parameter space.
if np.isnan(a) or (a > 1) or (a < -1):
print('Bad initialization, reinitializing.')
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
self.train()
#Predicting using the trained weights.
def predict(self, test_data):
for i in train_data:
a, X1 = self.xor_network(i)
#I cut off decimals past 12 for convienience, not necessary.
print(f'input: {i} - output: {np.round(a, 12)}')
Now let's take a look at the output:
#%% Execution
xor = XOR_class(train_data, target_data)
xor.train()
np.random.shuffle(data)
test_data = data[:,:2]
xor.predict(test_data)
input: [1 0] - output: [1.]
input: [0 0] - output: [0.]
input: [0 1] - output: [1.]
input: [1 1] - output: [0.]
And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .
The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.
Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.
Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.
This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.
For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function , we have the following:
Which Reduces to:
Where , , and
Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.
Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).
EDIT:
As pointed out by #hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.
At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.

When it comes to nonlinear regression, this is referring to how the weights affect the output. If a function is not linear with respect to the weights, then your problem is a nonlinear regression problem. So for example, let's look at a Feedforward Neural Network with one hidden layer where the activation functions in the hidden layer are some function and the output layer has linear activation functions. Given this, the mathematical representation can be:
where we assume can operator on scalars and vectors with this notation to make it easy. , , , and are the weight you are aiming to estimate with the regression. If this was linear regression, would equal z, because that would make y linearly dependent on & . But if is nonlinear, say like , then now y is nonlinearly dependent on the weights .
Now provided you understand all that, I am surprised you haven't seen discussion of the nonlinear case because that's pretty much all people talk about in textbooks and research. The use of things like stochastic gradient descent, Nonlinear Conjugate Gradient, RProp, and other methods are to help find local minima (and hopefully good local minima) for these nonlinear regression problems, even though a global optimum is not typically guaranteed.

Any non-linearity from the input to output makes the network non-linear. In the way we usually think about and implement neural networks, those non-linearities come from activation functions.
If we are trying to fit non-linear data and only have linear activation functions, our best approximation to the non-linear data will be linear since that's all we can compute. You can see an example of a neural network trying to fit non-linear data with only linear activation functions here.
However, if we change the linear activation function to something non-linear like ReLu, then we can see a better non-linear fitting of the data. You can see that here.

I do not have enough reputation to comment on itwasthekix post, but I want to share my insight.
Someone asked in the comments whether equation 8 was linear, and the
answer was, that if w1 were to be varied when all else is constant
we would move up and down a non-linear function. This is not true.
When we vary w1, we essentially only change the output of z1 = (w1*p + b1). Since z1 is linearly transformed later, we will still
move an a linear function. If we were to fix everything except w1
AND w2, then we would move on a non-linear function.
If a multi-layer ANN is non-linear in parameters, because we have a
multiplication of parameters. That does not mean it can learn non-linear relationships.
The answer is very misleading and makes it sound, that we can
learn non-linear relationships using only linear transformations,
which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
If we take the gradient of w1*w2 and perform gradient descent, we only know the joint gradient, there is no way to determine the influence of the separate parameters without fixing one of them. And if we fix one, we move on a linear function.
If we add an (non-linear) activation function, we linearly transform a non-linear output enabling us to learn non-linear relationships, since we do not move on a linear function anymore.
Lets look at the case z = w2 * g(w1 * p + b1) + b2 assuming g is a non-linear activation function. Then if we fix everything and vary w1, we will move on a non-linear function, since w1 * p + b1 is transformed by g.

Non-linearity means different things in communities of regression analysis and neural network machine learning.
In regression analysis, when we say a fitting model is nonlinear, we mean that the model is nonlinear in terms of its parameters (not in terms of the independent variables).
A multiple-layer neural network is usually nonlinear in terms of the weights even the activation function is linear. This is simple to see because the information propagating in the network corresponds to function composition: f3(f2(f1())), which generally gives nonlinear functions of weights. Therefore, in terms of regression analysis, all neural networks are nonlinear models.
However in the community of neural network, people talk about the linearity in terms of input variables, rather than the weights/biases. Therefore, they define a neutral network with linear activation functions as linear and that with nonlinear activation function as nonlinear.

I had the same struggle, most online courses use ANNs for classification, but you never actually solve a regression problem with them in the courses.
What does make an ANN non-linear? The activation function.
Even if you have an ANN with thousands of perceptrons and hidden units, if all the activations are linear (or not activated at all) you are just training a plain linear regression.
But be careful, some activations functions (like sigmoid), have a range of values that act as a linear function and you may get stuck with a linear model even with non-linear activations.
How to predict continuous output with an ANN? The same way as when you classify.
It is the same problem, you just backpropagate the error (label - prediction) and update the weights. But don't forget to CHANGE THE ACTIVATION FUNCTION of the output layer to a continuous function (maybe ReLu if all labels are positive or don't activate the output at all), the intermediate hidden layers can be activated however you wish.
For small regression problems with ANNs you may need to start with a veeeeeery small learning rate since there will be lots of variance since the error will be "unbounded" at first.
Hope this helps :)

I don't want to be impolite, but the current answers are all related to nonlinear ND-polynomials resulting from linear activation functions. That simply doesn't make sense in terms of this question.
I get the point because you will have a polynomial as the objective function to minimize with coefficients that are products of layer coefficients and a product is nonlinear. Anyway, such a system will never be able to converge and doesn't make sense at all without any extra constraints.
The described system is not only completely unnecessarily nonlinear, but also ill-posed. Don't argue about stuff that leads ad absurdum. The original question actually completely nailed it.
Build a "linear neural network" with layers and try to use it as usual... then you will realise that this goes nowhere and you wasted your time.
So unless there is good reasons to believe this kind of ill-posed stuff has been handled I would never ever consider using a linear activation function. If you have extra constraints this might make sense. If you use stochastic gradient descent then you will at least skip some bad properties of it.
That the objective function is nonlinear in its parameters gives an impression that is wrong and bogus. And if the writer would have known about optimisation problems connected to terms with a product of coefficients he would have never written anything like this.
Any objective function can be made nonlinear. If you just replace one linear coefficient with a product of two coefficients. But that is nonsense because you can never determine those coefficients. NEVER. There are infinitely many solutions! And that doesn't even depend on the amount of data.

Because the activation is w*x, which is linear operation, so you need to have extra elements to make it non-linear.

Label Normalization in Deep Regression Networks

In regression problems there is typically no reason for normalizing/rescaling the labels (targets) before performing the optimization.
In deep regression networks there would be in principle no need to rescale since the last activation function is linear and the cost function is the mean squared difference of the predictions from the targets.
On the other hand, for numerical stability and performance of the training process, the values of the input and hidden units are kept in the range [-1,1] via feature normalization. Doesn't it mean that the labels should be rescaled to the range [-1,1] too?

traditionally in regression problems, you denormalize the output generated

SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm

I'm new to SVMs in Matlab and need a little bit of help with it.
I want to train a support vector machine using the build in function fitcsvm of the Statistics Toolbox.
Of course there are many parameter choices which control how the SVM will be trained.
The Matlab help is a litte bit wage about how the parameters archive a better training result. Especially the parameter 'Box Contraint' seems to have an important influence on the number of chosen support vectors and generalization quality.
The Help (http://de.mathworks.com/help/stats/fitcsvm.html#bt8v_z4-1) says
A parameter that controls the maximum penalty imposed on margin-violating observations, and aids in preventing overfitting (regularization).
If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.
How exactly is this parameter used?
Is it the same or something like the soft margin factor C in the Wikipedia reference?
Or something completely different?
Thanks for your help.

You were definitely on the right path. While description in the documentation of fitcsvm (as you posted in the question) is very short, you should have a look at the Understanding Support Vector Machines site in the MATLAB documentation.
In the non-separable case (often called Soft-Margin SVM), one allows misclassifications, at the cost of a penalty factor C. The mathematical formulation of the SVM then becomes:
with the slack variables s_i which cause a penalty term which is weighted by C.
Making C large increases the weight of misclassifications, which leads to a stricter separation.
This factor C is called box constraint.
The reason for this name is, that in the formulation of the dual optimization problem, the Langrange multipliers are bounded to be within the range [0,C].
C thus poses a box constraint on the Lagrange multipliers.
tl;dr your guess was right, it is the C in the soft margin SVM.

Does it make sense to use an "activation function cocktail" for approximating an unknown function through a feed-forward neural network?

I just started playing around with neural networks and, as I would expect, in order to train a neural network effectively there must be some relation between the function to approximate and activation function.
For instance, I had good results using sin(x) as an activation function when approximating cos(x), or two tanh(x) to approximate a gaussian. Now, to approximate a function about which I know nothing I am planning to use a cocktail of activation functions, for instance a hidden layer with some sin, some tanh and a logistic function. In your opinion does this make sens?
Thank you,
Tunnuz

While it is true that different activation functions have different merits (mainly for either biological plausibility or a unique network design like radial basis function networks), in general you be able to use any continuous squashing function and expect to be able to approximate most functions encountered in real world training data.
The two most popular choices are the hyperbolic tangent and the logistic function, since they both have easily calculable derivatives and interesting behavior around the axis.
If neither if those allows you to accurately approximate your function, my first response wouldn't be to change activation functions. Rather, you should first investigate your training set and network training parameters (learning rates, number of units in each pool, weight decay, momentum, etc.).
If your still stuck, step back and make sure your using the right architecture (feed forward vs. simple recurrent vs. full recurrent) and learning algorithm (back-propagation vs. back-prop through time vs. contrastive hebbian vs. evolutionary/global methods).
One side note: Make sure you never use a linear activation function (except for output layers or crazy simple tasks), as these have very well documented limitations, namely the need for linear separability.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why scaling data is very important in neural network(LSTM) - neural-network

Your question reminds me of a picture used in our class, but you can find a similar one from here at 3:02. In the picture above you can see obviously that the path on the left is much longer than that on the right. The scaling is applied to the left to become the right one.

Related

Is it necessary to use a linear bottleneck layer for autoencoder?

Linear vs nonlinear neural network?

Label Normalization in Deep Regression Networks

SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm

Does it make sense to use an "activation function cocktail" for approximating an unknown function through a feed-forward neural network?

Categories

Resources