Non-linear classification vs regression with FFANN - matlab

I am trying to differentiate between two classes of data for forecasting. Basically the dependent variables are features of a signal that I want to forecast. I want to predict whether the signal will have a positive or negative slope in the near future (1 time step ahead). I have tried with different time series analysis, such as Fourier analysis, fitting using neural networks, auto-regressive models, and classification with neural nets (using patternet in Matlab).
The function is continuous, so the most logical assumption is to use some regression analysis tool to determine what's going to happen. However, since I only care whether the slope is going to positive or negative, I changed the signal to a binary signal (1 if the slope is positive, -1 if the slope is 0 or negative).
This is by the far the best results I have gotten! However, for some unknown reason a neural net designed for classification did not work (the confusion matrix stated that there was a precision of around 50%). So I decided to try with a regular feedforward neural net...
Since the neural network outputs continuous data, I didn't know what to do... But then I remembered about Logistic regression, and since its transfer function is a log function (bounded by 0 and 1), it can be interpreted as a probability. So I basically did the same, defined a threshold (e.g above 0 is 1, below 0 is -1), and voila! The precision sky-rocked! I am getting a precision of around 70-80%.
Since I am using a sigmoid transfer function, the neural network wll have a continuous output just as logistic regression (but on this case between -1 and 1), so I am assuming my approach is technically still regression and not classification. My question is... Which is better? For my specific problem where fitting did not give really good results but I had to convert this to a binary problem... Which should give better results? Classification or regression?
Should I try a different configuration of a neural net (with a different transfer function), should I try with support vector machine or any other classification algorithm? Or should I stick with regression but defining a threshold myself just as I would do with logistic regression?

Related

Linear vs nonlinear neural network?

I'm new to machine learning and neural networks. I know how to build a nonlinear classification model, but my current problem has a continuous output. I've been searching for information on neural network regression, but all I encounter is information on linear regression - nothing about nonlinear cases. Which is odd, because why would someone use neural networks to solve a simple linear regression anyway? Isn't that like killing a fly with a nuclear bomb?
So my question is this: what makes a neural network nonlinear? (Hidden layers? Nonlinear activation function?) Or do I have a completely wrong understanding of the word "linear" - can a linear regression NN accurately model datasets that are more complex than y=aX+b? Is the word "linear" used just as the opposite of "logistic"?
(I'm planning to use TensorFlow, but the TensorFlow Linear Model Tutorial uses a binary classification problem as an example, so that doesn't help me either.)
For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.
A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.
The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.
WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.
Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.
First let me point out that we can talk about linearity in parameters or linearity in the variables.
The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).
For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):
Now for simplicity, let us consider a single neuron, single layer neural network:
If the transfer function is linear then:
As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):
Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):
(first layer)
(second layer)
Now notice:
Reduces to:
Where and
Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).
That being said, let's take a look at the following statement from #Pawu regarding this answer:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
While you could argue that what #Pawu is saying is technically true, I think they are implying:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.
I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.
Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:
The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.
Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:
#%% Packages
import numpy as np
#%% Data
data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]])
np.random.shuffle(data)
train_data = data[:,:2]
target_data = data[:,2]
#%% XOR architecture
class XOR_class():
def __init__(self, train_data, target_data, alpha=.1, epochs=10000):
self.train_data = train_data
self.target_data = target_data
self.alpha = alpha
self.epochs = epochs
#Random weights
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
#xor network (linear transfer functions only)
def xor_network(self, X0):
n0 = np.dot(X0, self.W0) + self.b0
X1 = n0*X0
a = np.dot(X1, self.W2) + self.b2
return(a, X1)
#Training the xor network
def train(self):
for epoch in range(self.epochs):
for i in range(len(self.train_data)):
# Forward Propagation:
X0 = self.train_data[i]
a, X1 = self.xor_network(X0)
# Backward Propagation:
e = self.target_data[i] - a
s_2 = -2*e
# Update Weights:
self.W0 = self.W0 - (self.alpha*s_2*X0)
self.b0 = self.b0 - (self.alpha*s_2)
self.W2 = self.W2 - (self.alpha*s_2*X1)
self.b2 = self.b2 - (self.alpha*s_2)
#Restart training if we get lost in the parameter space.
if np.isnan(a) or (a > 1) or (a < -1):
print('Bad initialization, reinitializing.')
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
self.train()
#Predicting using the trained weights.
def predict(self, test_data):
for i in train_data:
a, X1 = self.xor_network(i)
#I cut off decimals past 12 for convienience, not necessary.
print(f'input: {i} - output: {np.round(a, 12)}')
Now let's take a look at the output:
#%% Execution
xor = XOR_class(train_data, target_data)
xor.train()
np.random.shuffle(data)
test_data = data[:,:2]
xor.predict(test_data)
input: [1 0] - output: [1.]
input: [0 0] - output: [0.]
input: [0 1] - output: [1.]
input: [1 1] - output: [0.]
And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .
The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.
Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.
Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.
This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.
For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function , we have the following:
Which Reduces to:
Where , , and
Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.
Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).
EDIT:
As pointed out by #hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.
At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.
When it comes to nonlinear regression, this is referring to how the weights affect the output. If a function is not linear with respect to the weights, then your problem is a nonlinear regression problem. So for example, let's look at a Feedforward Neural Network with one hidden layer where the activation functions in the hidden layer are some function and the output layer has linear activation functions. Given this, the mathematical representation can be:
where we assume can operator on scalars and vectors with this notation to make it easy. , , , and are the weight you are aiming to estimate with the regression. If this was linear regression, would equal z, because that would make y linearly dependent on & . But if is nonlinear, say like , then now y is nonlinearly dependent on the weights .
Now provided you understand all that, I am surprised you haven't seen discussion of the nonlinear case because that's pretty much all people talk about in textbooks and research. The use of things like stochastic gradient descent, Nonlinear Conjugate Gradient, RProp, and other methods are to help find local minima (and hopefully good local minima) for these nonlinear regression problems, even though a global optimum is not typically guaranteed.
Any non-linearity from the input to output makes the network non-linear. In the way we usually think about and implement neural networks, those non-linearities come from activation functions.
If we are trying to fit non-linear data and only have linear activation functions, our best approximation to the non-linear data will be linear since that's all we can compute. You can see an example of a neural network trying to fit non-linear data with only linear activation functions here.
However, if we change the linear activation function to something non-linear like ReLu, then we can see a better non-linear fitting of the data. You can see that here.
I do not have enough reputation to comment on itwasthekix post, but I want to share my insight.
Someone asked in the comments whether equation 8 was linear, and the
answer was, that if w1 were to be varied when all else is constant
we would move up and down a non-linear function. This is not true.
When we vary w1, we essentially only change the output of z1 = (w1*p + b1). Since z1 is linearly transformed later, we will still
move an a linear function. If we were to fix everything except w1
AND w2, then we would move on a non-linear function.
If a multi-layer ANN is non-linear in parameters, because we have a
multiplication of parameters. That does not mean it can learn non-linear relationships.
The answer is very misleading and makes it sound, that we can
learn non-linear relationships using only linear transformations,
which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
If we take the gradient of w1*w2 and perform gradient descent, we only know the joint gradient, there is no way to determine the influence of the separate parameters without fixing one of them. And if we fix one, we move on a linear function.
If we add an (non-linear) activation function, we linearly transform a non-linear output enabling us to learn non-linear relationships, since we do not move on a linear function anymore.
Lets look at the case z = w2 * g(w1 * p + b1) + b2 assuming g is a non-linear activation function. Then if we fix everything and vary w1, we will move on a non-linear function, since w1 * p + b1 is transformed by g.
Non-linearity means different things in communities of regression analysis and neural network machine learning.
In regression analysis, when we say a fitting model is nonlinear, we mean that the model is nonlinear in terms of its parameters (not in terms of the independent variables).
A multiple-layer neural network is usually nonlinear in terms of the weights even the activation function is linear. This is simple to see because the information propagating in the network corresponds to function composition: f3(f2(f1())), which generally gives nonlinear functions of weights. Therefore, in terms of regression analysis, all neural networks are nonlinear models.
However in the community of neural network, people talk about the linearity in terms of input variables, rather than the weights/biases. Therefore, they define a neutral network with linear activation functions as linear and that with nonlinear activation function as nonlinear.
I had the same struggle, most online courses use ANNs for classification, but you never actually solve a regression problem with them in the courses.
What does make an ANN non-linear? The activation function.
Even if you have an ANN with thousands of perceptrons and hidden units, if all the activations are linear (or not activated at all) you are just training a plain linear regression.
But be careful, some activations functions (like sigmoid), have a range of values that act as a linear function and you may get stuck with a linear model even with non-linear activations.
How to predict continuous output with an ANN? The same way as when you classify.
It is the same problem, you just backpropagate the error (label - prediction) and update the weights. But don't forget to CHANGE THE ACTIVATION FUNCTION of the output layer to a continuous function (maybe ReLu if all labels are positive or don't activate the output at all), the intermediate hidden layers can be activated however you wish.
For small regression problems with ANNs you may need to start with a veeeeeery small learning rate since there will be lots of variance since the error will be "unbounded" at first.
Hope this helps :)
I don't want to be impolite, but the current answers are all related to nonlinear ND-polynomials resulting from linear activation functions. That simply doesn't make sense in terms of this question.
I get the point because you will have a polynomial as the objective function to minimize with coefficients that are products of layer coefficients and a product is nonlinear. Anyway, such a system will never be able to converge and doesn't make sense at all without any extra constraints.
The described system is not only completely unnecessarily nonlinear, but also ill-posed. Don't argue about stuff that leads ad absurdum. The original question actually completely nailed it.
Build a "linear neural network" with layers and try to use it as usual... then you will realise that this goes nowhere and you wasted your time.
So unless there is good reasons to believe this kind of ill-posed stuff has been handled I would never ever consider using a linear activation function. If you have extra constraints this might make sense. If you use stochastic gradient descent then you will at least skip some bad properties of it.
That the objective function is nonlinear in its parameters gives an impression that is wrong and bogus. And if the writer would have known about optimisation problems connected to terms with a product of coefficients he would have never written anything like this.
Any objective function can be made nonlinear. If you just replace one linear coefficient with a product of two coefficients. But that is nonsense because you can never determine those coefficients. NEVER. There are infinitely many solutions! And that doesn't even depend on the amount of data.
Because the activation is w*x, which is linear operation, so you need to have extra elements to make it non-linear.

Label Normalization in Deep Regression Networks

In regression problems there is typically no reason for normalizing/rescaling the labels (targets) before performing the optimization.
In deep regression networks there would be in principle no need to rescale since the last activation function is linear and the cost function is the mean squared difference of the predictions from the targets.
On the other hand, for numerical stability and performance of the training process, the values of the input and hidden units are kept in the range [-1,1] via feature normalization. Doesn't it mean that the labels should be rescaled to the range [-1,1] too?
traditionally in regression problems, you denormalize the output generated

Step function versus Sigmoid function

I don't quite understand why a sigmoid function is seen as more useful (for neural networks) than a step function... hoping someone can explain this for me. Thanks in advance.
The (Heaviside) step function is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable.
However, multi-layer neural networks or multi-layer perceptrons are of more interest because they are general function approximators and they are able to distinguish data that is not linearly separable.
Multi-layer perceptrons are trained using backpropapagation. A requirement for backpropagation is a differentiable activation function. That's because backpropagation uses gradient descent on this function to update the network weights.
The Heaviside step function is non-differentiable at x = 0 and its derivative is 0 elsewhere. This means gradient descent won't be able to make progress in updating the weights and backpropagation will fail.
The sigmoid or logistic function does not have this shortcoming and this explains its usefulness as an activation function within the field of neural networks.
It depends on the problem you are dealing with. In case of simple binary classification, a step function is appropriate. Sigmoids can be useful when building more biologically realistic networks by introducing noise or uncertainty. Another but compeletely different use of sigmoids is for numerical continuation, i.e. when doing bifurcation analysis with respect to some parameter in the model. Numerical continuation is easier with smooth systems (and very tricky with non-smooth ones).

Binary Classification Cost Function, Neural Networks

I've been tweaking the Deep Learning tutorial to train the weights of a Logistic Regression model for a binary classification problem and the tutorial uses the negative log-likelihood cost function below...
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
However, my weights don't seem to be converging properly as my validation error increases over successive epochs.
I was wondering if I'm using the proper cost function to converge upon the proper weights. It might be useful to note that my two classes are very imbalanced and my predictors are already normalized
Few reasons I can think of are:
Your learning rate is too high
For binary classification, try squared error or a cross entropy error instead of negative log likelihood.
You are using just one layer. May be the dataset you are using requires more layers. So connect more hidden layers.
Play around with the number of layers and hidden units.

Neural Network with softmax activation

edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK
look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);