How to guarantee convergence when training a neural differential equation? - neural-network

I'm currently working through the SciML tutorials workshop exercises for the Julia language (https://tutorials.sciml.ai/html/exercises/01-workshop_exercises.html). Specifically, I'm stuck on exercise 6 part 3, which involves training a neural network to approximate the system of equations
function lotka_volterra(du,u,p,t)
x, y = u
α, β, δ, γ = p
du[1] = dx = α*x - β*x*y
du[2] = dy = -δ*y + γ*x*y
end
The goal is to replace the equation for du[2] with a neural network: du[2] = NN(u, p)
where NN is a neural net with parameters p and inputs u.
I have a set of sample data that the network should try to match. The loss function is the squared difference between the network model's output and that sample data.
I defined my network with
NN = Chain(Dense(2,30), Dense(30, 1)). I can get Flux.train! to run, but the problem is that sometimes the initial parameters for the neural network result in a loss on the order of 10^20 and so training never converges. My best try got the loss down from about 2000 initially to about 20 using the ADAM optimizer over about 1000 iterations, but I can't seem to do any better.
How can I make sure my network is consistently trainable, and is there a way to get better convergence?

How can I make sure my network is consistently trainable, and is there a way to get better convergence?
See the FAQ page on techniques for improving convergence. In a nutshell, the single shooting approach of most ML papers is very unstable and does not work on most practical problems, but there are a litany of techniques to help out. One of the best ones is multiple shooting, which optimizes only short bursts (in parallel) along the time series.
But training on a small interval and growing the interval works, also using more stable optimizers (BFGS) can work. You can also weigh the loss function so that earlier times mean more. Lastly, you can minibatch in a way similar to multiple shooting, i.e. start from a data point and only solve to the next (in fact, if you actually look at the original neural ODE paper NumPy code, they do not do the algorithm as explained but instead do this form of sampling to stabilize the spiral ODE training).

Related

How to deal with the randomness of NN training process?

Consider the training process of deep FF neural network using mini-batch gradient descent. As far as I understand, at each epoch of the training we have different random set of mini-batches. Then iterating over all mini batches and computing the gradients of the NN parameters we will get random gradients at each iteration and, therefore, random directions for the model parameters to minimize the cost function. Let's imagine we fixed the hyperparameters of the training algorithm and started the training process again and again, then we would end up with models, which completely differs from each other, because in those trainings the changes of model parameters were different.
1) Is it always the case when we use such random based training algorithms?
2) If it is so, where is the guaranty that training the NN one more time with the best hyperparameters found during the previous trainings and validations will yield us the best model again?
3) Is it possible to find such hyperparameters, which will always yield the best models?
Neural Network are solving a optimization problem, As long as it is computing a gradient in right direction but can be random, it doesn't hurt its objective to generalize over data. It can stuck in some local optima. But there are many good methods like Adam, RMSProp, momentum based etc, by which it can accomplish its objective.
Another reason, when you say mini-batch, there is at least some sample by which it can generalize over those sample, there can be fluctuation in the error rate, and but at least it can give us a local solution.
Even, at each random sampling, these mini-batch have different-2 sample, which helps in generalize well over the complete distribution.
For hyperparameter selection, you need to do tuning and validate result on unseen data, there is no straight forward method to choose these.

Linear vs nonlinear neural network?

I'm new to machine learning and neural networks. I know how to build a nonlinear classification model, but my current problem has a continuous output. I've been searching for information on neural network regression, but all I encounter is information on linear regression - nothing about nonlinear cases. Which is odd, because why would someone use neural networks to solve a simple linear regression anyway? Isn't that like killing a fly with a nuclear bomb?
So my question is this: what makes a neural network nonlinear? (Hidden layers? Nonlinear activation function?) Or do I have a completely wrong understanding of the word "linear" - can a linear regression NN accurately model datasets that are more complex than y=aX+b? Is the word "linear" used just as the opposite of "logistic"?
(I'm planning to use TensorFlow, but the TensorFlow Linear Model Tutorial uses a binary classification problem as an example, so that doesn't help me either.)
For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.
A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.
The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.
WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.
Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.
First let me point out that we can talk about linearity in parameters or linearity in the variables.
The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).
For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):
Now for simplicity, let us consider a single neuron, single layer neural network:
If the transfer function is linear then:
As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):
Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):
(first layer)
(second layer)
Now notice:
Reduces to:
Where and
Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).
That being said, let's take a look at the following statement from #Pawu regarding this answer:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
While you could argue that what #Pawu is saying is technically true, I think they are implying:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.
I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.
Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:
The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.
Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:
#%% Packages
import numpy as np
#%% Data
data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]])
np.random.shuffle(data)
train_data = data[:,:2]
target_data = data[:,2]
#%% XOR architecture
class XOR_class():
def __init__(self, train_data, target_data, alpha=.1, epochs=10000):
self.train_data = train_data
self.target_data = target_data
self.alpha = alpha
self.epochs = epochs
#Random weights
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
#xor network (linear transfer functions only)
def xor_network(self, X0):
n0 = np.dot(X0, self.W0) + self.b0
X1 = n0*X0
a = np.dot(X1, self.W2) + self.b2
return(a, X1)
#Training the xor network
def train(self):
for epoch in range(self.epochs):
for i in range(len(self.train_data)):
# Forward Propagation:
X0 = self.train_data[i]
a, X1 = self.xor_network(X0)
# Backward Propagation:
e = self.target_data[i] - a
s_2 = -2*e
# Update Weights:
self.W0 = self.W0 - (self.alpha*s_2*X0)
self.b0 = self.b0 - (self.alpha*s_2)
self.W2 = self.W2 - (self.alpha*s_2*X1)
self.b2 = self.b2 - (self.alpha*s_2)
#Restart training if we get lost in the parameter space.
if np.isnan(a) or (a > 1) or (a < -1):
print('Bad initialization, reinitializing.')
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
self.train()
#Predicting using the trained weights.
def predict(self, test_data):
for i in train_data:
a, X1 = self.xor_network(i)
#I cut off decimals past 12 for convienience, not necessary.
print(f'input: {i} - output: {np.round(a, 12)}')
Now let's take a look at the output:
#%% Execution
xor = XOR_class(train_data, target_data)
xor.train()
np.random.shuffle(data)
test_data = data[:,:2]
xor.predict(test_data)
input: [1 0] - output: [1.]
input: [0 0] - output: [0.]
input: [0 1] - output: [1.]
input: [1 1] - output: [0.]
And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .
The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.
Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.
Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.
This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.
For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function , we have the following:
Which Reduces to:
Where , , and
Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.
Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).
EDIT:
As pointed out by #hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.
At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.
When it comes to nonlinear regression, this is referring to how the weights affect the output. If a function is not linear with respect to the weights, then your problem is a nonlinear regression problem. So for example, let's look at a Feedforward Neural Network with one hidden layer where the activation functions in the hidden layer are some function and the output layer has linear activation functions. Given this, the mathematical representation can be:
where we assume can operator on scalars and vectors with this notation to make it easy. , , , and are the weight you are aiming to estimate with the regression. If this was linear regression, would equal z, because that would make y linearly dependent on & . But if is nonlinear, say like , then now y is nonlinearly dependent on the weights .
Now provided you understand all that, I am surprised you haven't seen discussion of the nonlinear case because that's pretty much all people talk about in textbooks and research. The use of things like stochastic gradient descent, Nonlinear Conjugate Gradient, RProp, and other methods are to help find local minima (and hopefully good local minima) for these nonlinear regression problems, even though a global optimum is not typically guaranteed.
Any non-linearity from the input to output makes the network non-linear. In the way we usually think about and implement neural networks, those non-linearities come from activation functions.
If we are trying to fit non-linear data and only have linear activation functions, our best approximation to the non-linear data will be linear since that's all we can compute. You can see an example of a neural network trying to fit non-linear data with only linear activation functions here.
However, if we change the linear activation function to something non-linear like ReLu, then we can see a better non-linear fitting of the data. You can see that here.
I do not have enough reputation to comment on itwasthekix post, but I want to share my insight.
Someone asked in the comments whether equation 8 was linear, and the
answer was, that if w1 were to be varied when all else is constant
we would move up and down a non-linear function. This is not true.
When we vary w1, we essentially only change the output of z1 = (w1*p + b1). Since z1 is linearly transformed later, we will still
move an a linear function. If we were to fix everything except w1
AND w2, then we would move on a non-linear function.
If a multi-layer ANN is non-linear in parameters, because we have a
multiplication of parameters. That does not mean it can learn non-linear relationships.
The answer is very misleading and makes it sound, that we can
learn non-linear relationships using only linear transformations,
which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
If we take the gradient of w1*w2 and perform gradient descent, we only know the joint gradient, there is no way to determine the influence of the separate parameters without fixing one of them. And if we fix one, we move on a linear function.
If we add an (non-linear) activation function, we linearly transform a non-linear output enabling us to learn non-linear relationships, since we do not move on a linear function anymore.
Lets look at the case z = w2 * g(w1 * p + b1) + b2 assuming g is a non-linear activation function. Then if we fix everything and vary w1, we will move on a non-linear function, since w1 * p + b1 is transformed by g.
Non-linearity means different things in communities of regression analysis and neural network machine learning.
In regression analysis, when we say a fitting model is nonlinear, we mean that the model is nonlinear in terms of its parameters (not in terms of the independent variables).
A multiple-layer neural network is usually nonlinear in terms of the weights even the activation function is linear. This is simple to see because the information propagating in the network corresponds to function composition: f3(f2(f1())), which generally gives nonlinear functions of weights. Therefore, in terms of regression analysis, all neural networks are nonlinear models.
However in the community of neural network, people talk about the linearity in terms of input variables, rather than the weights/biases. Therefore, they define a neutral network with linear activation functions as linear and that with nonlinear activation function as nonlinear.
I had the same struggle, most online courses use ANNs for classification, but you never actually solve a regression problem with them in the courses.
What does make an ANN non-linear? The activation function.
Even if you have an ANN with thousands of perceptrons and hidden units, if all the activations are linear (or not activated at all) you are just training a plain linear regression.
But be careful, some activations functions (like sigmoid), have a range of values that act as a linear function and you may get stuck with a linear model even with non-linear activations.
How to predict continuous output with an ANN? The same way as when you classify.
It is the same problem, you just backpropagate the error (label - prediction) and update the weights. But don't forget to CHANGE THE ACTIVATION FUNCTION of the output layer to a continuous function (maybe ReLu if all labels are positive or don't activate the output at all), the intermediate hidden layers can be activated however you wish.
For small regression problems with ANNs you may need to start with a veeeeeery small learning rate since there will be lots of variance since the error will be "unbounded" at first.
Hope this helps :)
I don't want to be impolite, but the current answers are all related to nonlinear ND-polynomials resulting from linear activation functions. That simply doesn't make sense in terms of this question.
I get the point because you will have a polynomial as the objective function to minimize with coefficients that are products of layer coefficients and a product is nonlinear. Anyway, such a system will never be able to converge and doesn't make sense at all without any extra constraints.
The described system is not only completely unnecessarily nonlinear, but also ill-posed. Don't argue about stuff that leads ad absurdum. The original question actually completely nailed it.
Build a "linear neural network" with layers and try to use it as usual... then you will realise that this goes nowhere and you wasted your time.
So unless there is good reasons to believe this kind of ill-posed stuff has been handled I would never ever consider using a linear activation function. If you have extra constraints this might make sense. If you use stochastic gradient descent then you will at least skip some bad properties of it.
That the objective function is nonlinear in its parameters gives an impression that is wrong and bogus. And if the writer would have known about optimisation problems connected to terms with a product of coefficients he would have never written anything like this.
Any objective function can be made nonlinear. If you just replace one linear coefficient with a product of two coefficients. But that is nonsense because you can never determine those coefficients. NEVER. There are infinitely many solutions! And that doesn't even depend on the amount of data.
Because the activation is w*x, which is linear operation, so you need to have extra elements to make it non-linear.

Can't approximate simple multiplication function in neural network with 1 hidden layer

I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0

Matlab: Help in running toolbox for Kalman filter

I have AR(1) model with data samples $N=500$ that is driven by a random input sequence x. THe observation y is corrupted with measurement noise $v$ of zero mean. The model is
y(t) = 0.195y(t-1) + x(t) + v(t) where x(t) is generated as randn(). I am unsure how to represent this as a state space model and how to estimate the parameters $a$ and the states. I tried the state space representation would be
d(t) = \mathbf{a^T} d(t) + x(t)
y(t) = \mathbf{h^T}d(t) + sigma*v(t)
sigma =2.
I cannot understand how to perform parameter and state estimation. Using the toolbox mentioned below, I checked the Equations of KF to be matching with those in textbooks. However, the approach for parameter estimation is different. I will appreciate a recommendation for the implementation procedure.
Implementation 1:
I am following the implementation here : Learning Kalman Filter. This implementation does not use Expectation Maximization to estimate the parameters of AR model and it finds out the Covariance of the process noise. In my case, I don't have a process noise, but an input $x$.
Implementation 2: Kalman Filter by Kevin Murphy is another toolbox which uses EM for parameter estimation of AR model. Now, it is confusing since both the implementations uses different approach for parameter estimation.
I am having a tough time figuring out the correct approach, the state space model representation and the code. Shall appreciate recommendations on how to proceed.
I ran the first implementation for the KalmanARSquareRoot technique and the result is completely different. There is Exponential Moving Average Smoothing being performed and a MA filer of length 30 being used. The toolbox runs fine if I run the Demo examples. But on changing the model, the result is extremely poor. Maybe I am doing something wrong. Do I need to change the equations of KF for my time series?
In the second implementation, I cannot figure out what and where to change the Equations.
In general, if I have to use these tools, then do I need to change the KF equations for every time series model? How do I write the Equations myself if these toolboxes are inappropriate for all the time series model - AR, MA, ARMA?
I only have a bit of experience with Kalman Filters, so take this with a grain of salt.
It seems you shouldn't need to change the equations at all. Working with the second package (learn_kalman), you can create an A0 matrix of size [length(d(t)) length(d(t))]. C0 is the same, and in your case the initial state probably makes sense to be the Identity matrix (unlike your A0. All you need to do is choose a good initial condition.
However, I took a look at your system (plotted an example) and it seems noise dominates your system. KF is an optimal estimator but I have not known it to reject that much noise. It only guarantees a reduced covariance...which means that if your system is mostly dominated by noise, you will calculate a bad model that estimates your system given the noise!
Try plotting [d f] where d is the initial data and f is calculated using the regressive formula f(t) = C * A * f(t-1) :
f(t) = A * f(t-1)
; y(t) = C * f(t)
That is, pretend as if there is no noise but using the estimated AR model. You will see that it rejects all the noise and 'technically' models the system well (since the only unique behaviour is at the very beginning).
For example, if you have a system with A = 0.195, Q=R=0.l then you will converge to an A = 0.2207 but it still isn't good enough. Here the problem is that your initial state is so low, that within a few steps of data and you are essentially at 0 accounting for noise. Naturally KF can converge to a LOT of model solutions that are similar. Any noise will throw off even the best initial condition.
If you increase the resolution of your data in some way (e.g. larger initial condition, more refined timesteps) you will see a good fit. Ex, changing your initial condition to 110 and you'll find the two curves similar, though the model is still fairly different.
I am not aware of any approach to model your data well. If the noise variance is in fact 1 and your system converges to 0 that quickly, it seems doomed to not be effectively modelled since you just don't capture any unique behaviour in the dataset.

Conceptual issues on training neural network wih particle swarm optimization

I have a 4 Input and 3 Output Neural network trained by particle swarm optimization (PSO) with Mean square error (MSE) as the fitness function using the IRIS Database provided by MATLAB. The fitness function is evaluated 50 times. The experiment is to classify features. I have a few doubts
(1) Does the PSO iterations/generations = number of times the fitness function is evaluated?
(2) In many papers I have seen the training curve of MSE vs generations being plot. In the picture, the graph (a) on left side is a model similar to NN. It is a 4 input-0 hidden layer-3 output cognitive map. And graph (b) is a NN trained by the same PSO. The purpose of this paper was to show the effectiveness of the new model in (a) over NN.
But they mention that the experiment is conducted say Cycles = 100 times with Generations =300. In that case, the Training curve for (a) and (b) should have been MSE vs Cycles and not MSE vs PSO generations ? For ex, Cycle1 : PSO iteration 1-50 --> Result(Weights_1,Bias_1, MSE_1, Classification Rate_1). Cycle2: PSO iteration 1- 50 -->Result(Weights_2,Bias_2, MSE_2, Classification Rate_2) and so on for 100 Cycles. How come the X axis in (a),(b) is different and what do they mean?
(3) Lastly, for every independent run of the program (Running the m file several times independently, through the console) , I never get the same classification rate (CR) or the same set of weights. Concretely, when I first run the program I get W (Weights) values and CR =100%. When I again run the Matlab code program, I may get CR = 50% and another set of weights!! As shown below for an example,
%Run1 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
0
Classification rate = 25
FinalWeightsBias =
-0.1156 0.2487 2.2868 0.4460 0.3013 2.5761
%Run2 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
1
Classification rate = 100
%Run3 (PSO generaions 1-50)
>>PSO_NN.m
Correlation =
-0.1260
Classification rate = 37.5
FinalWeightsBias =
-0.1726 0.3468 0.6298 -0.0373 0.2954 -0.3254
What should be the correct method? So, which weight set should I finally take and how do I say that the network has been trained? I am aware that evolutionary algorithms due to their randomness will never give the same answer, but then how do I ensure that the network has been trained?
Shall be obliged for clarification.
As in most machine learning methods, the number of iterations in PSO is the number of times the solution is updated. In the case of PSO, this is the number of update rounds over all particles. The cost function here is evaluated after every particle is updated, so more than the number of iterations. Approximately, (# cost function calls) = (# iterations) * (# particles).
The graphs here are comparing different classifiers, fuzzy cognitive maps for graph (a) and a neural network for graph (b). So the X-axis displays the relevant measures of learning iterations for each.
Every time you run your NN you initialize it with different random values, so the results are never the same. The fact that the results vary greatly from one run to the next means you have a convergence problem. The first thing to do in this case is to try running more iterations. In general convergence is a rather complicated issue and the solutions vary greatly with applications (and read carefully through the answer and comments Isaac gave you on your other question). If your problem persists after increasing the number of iterations, you can post it as a new question, providing a sample of your data and the actual code you use to construct and train the network.