What weights should I use to start a NEAT based neural network? - neural-network

After reading a few papers on Neuro Evolution, more specifically NEAT, I realised that there was very little information regarding how you should weight each synapse at the start of the Neural Network. I understand that at the start, using NEAT, all the input neurons are connected to the output neuron, and then evolution takes place from there. However, should you weight each synapse randomly at the start, or simply set each one to 1?

It doesn't really matter a lot - it matters most how you mutate the weights of the connections in a genome.
However, setting the weights of each genome's connections to a random value is best: it acts like a small random search in the 'right' direction. If you'd set all the weights the same in for each genome, then weights in genomes will be extremely similar: keep in mind that a genome has a lot of connections, and with a mutation rate of 0.3 and two mutation options for example, only 15% of the population will have at least óne different weight after just 1 generation.
So make it something random, like random() * .2 - .1 (distribute between [-0.1, 0.1]). Just figure out what values work best for you.

Related

why is my Neural Network stuck at high loss value after the first epochs

I'm doing regression using Neural Networks. It should be a simple task for NN to do, I have 10 features and 1 output that I want to predict.I’m using pytorch for my project but my Model is not learning well. the loss start with a very high value (40000), then after the first 5-10 epochs the loss decrease rapidly to 6000-7000 and then it stuck there, no matter what I make. I tried even to change to skorch instead of pytorch so that I can use cross validation functionality but that also didn’t help. I tried different optimizers and added layers and neurons to the network but that didn’t help, it stuck at 6000 which is a very high loss value. I’m doing regression here, I have 10 features and I’m trying to predict one continuous value. that should be easy to do that’s why it is confusing me more.
here is my network:
I tried here all the possibilities from making more complex architectures like adding layers and units to batch normalization, changing activations etc.. nothing have worked
class BearingNetwork(nn.Module):
def __init__(self, n_features=X.shape[1], n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(),
nn.Linear(512, 64),
nn.BatchNorm1d(64),
nn.LeakyReLU(),
nn.Linear(64, n_out),
# nn.LeakyReLU(),
# nn.Linear(256, 128),
# nn.LeakyReLU(),
# nn.Linear(128, 64),
# nn.LeakyReLU(),
# nn.Linear(64, n_out)
)
def forward(self, x):
out = self.model(x)
return out
and here are my settings:
using skorch is easier than pytorch. here I'm monitoring also the R2 metric and I made RMSE as a custom metric to also monitor the performance of my model. I also tried the amsgrad for Adam but that didn't help.
R2 = EpochScoring(r2_score, lower_is_better=False, name='R2')
explained_var_score = EpochScoring(EVS, lower_is_better=False, name='EVS Metric')
custom_score = make_scorer(RMSE)
rmse = EpochScoring(custom_score, lower_is_better=True, name='rmse')
bearing_nn = NeuralNetRegressor(
BearingNetwork,
criterion=nn.MSELoss,
optimizer=optim.Adam,
optimizer__amsgrad=True,
max_epochs=5000,
batch_size=128,
lr=0.001,
train_split=skorch.dataset.CVSplit(10),
callbacks=[R2, explained_var_score, rmse, Checkpoint(), EarlyStopping(patience=100)],
device=device
)
I also standardize the Input values.
my Input have the shape:
torch.Size([39006, 10])
and shape of output is:
torch.Size([39006, 1])
I’m using 128 as my Batch_size but I also tried other values like 32, 64, 512 and even 1024. Although normalizing output is not necessary but I also tried that and It didn’t work when I predict values, the loss is high. Please someone help me on this, I would appreciate every helpful advice. I ll also add a screenshot of my training and val losses and metrics over epochs to visualize how the loss is decreasing in the first 5 epochs and then it stays like forever at the value 6000 which is a very high value for a loss.
considering that your training and dev loss are decreasing over time, it seems like your model is training correctly. With respect to your worry regarding your training and dev loss values, this is entirely dependent on the scale of your target values (how big are your target values?) and the metric used to compute the training and dev losses. If your target values are big and you want smaller train and dev loss values, you can normalise the target values.
From what I gather with respect to your experiments as well as your R2 scores, it seems that you are looking for a solution in the wrong area. To me, it seems like your features aren't strong enough considering that your R2 scores are low, which could mean that you have a data quality issue. This would also explain why your architecture tuning has not improved your model's performance as it is not your model that is the issue. So if I were you, I would think about what new useful features I could add and see if that helps. In machine learning, the general rule is that models are only as good as the data that they are trained on. I hope this helps!
The metric you should be looking at is R^2, not the magnitude of the loss function. The purpose of a loss function is just to let the optimizer know if it's going in the right direction--it's not a measure of fit that's comparable across data sets and learning setups. That's what R^2 is for.
Your R^2 scores show that you're explaining around a third of the total variance in the output, which is often a very good result for a data set with only 10 features. Actually, given the shape of your data, it's more likely that your hidden layers are considerably larger than necessary and risk over fitting.
To really evaluate this model, you'd need to know (1) how the R^2 score compares to simpler regression approaches like OLS and (2) why you should have any confidence that more than 30% of the output variance should be captured by the input variables.
For #1, at least the R^2 shouldn't be worse. As for #2, consider the canonical digit categorization example. We know that all the information necessary to recognize digits with very high accuracy (i.e. R^2 approaching 1) because humans can do it. That's not necessarily the case with other data sets, because there are important sources of variance that aren't captured in the source data.
As your loss decreases from 40000 to 6000, that means your NN model has learnt the prevalent relation but not all of them. You can aid this learning by transforming the predictor variables and then feeding them as derived ones to your model and see if that helps. You can try performing step wise addition of features to your NN model, by adding the most influential predictors first. At every iteration evaluate the model performance (i.e. training loss).
If first step doesn't help and as you are open to other approaches, Presuming your data's dynamics, Gaussian process Regression or Quantile regression should help as these methods are free from assumptions like linear regression techniques. Also it should help to explore different aspects of relationship between your independent and dependent variable.

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

In what order should we tune hyperparameters in Neural Networks?

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
batch-size
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter
I have 2 questions :
1) Do you see any other hyperparameter I might have forgotten ?
2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way.
Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ...
I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ?
Is there any paper or article which talks about properly tuning all the parameters in a special order ?
EDIT :
Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?
Accuracy
Cost
For the learning rate, I used the decay :
LR(t) = LRI/sqrt(epoch)
Thanks for your help !
Paul
My general order is:
Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
Number of neurons in the network
Number of layers
Rest (dropout, L2 reg, etc.)
Dependencies:
I'd assume that the optimal values of
learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers
strongly depend on each other. I am not an expert on that field though.
As for your hyperparameters:
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.
Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.
For parameters that are less important you can probably just pick a reasonable value and stick with it.
Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.
I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.
https://github.com/kuz/caffe-with-spearmint
for more information, you can follow this tutorial as well:
http://fastml.com/optimizing-hyperparams-with-hyperopt/
For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.
Before you begin:
Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.
Then, if you want to do it one by one, I would go like this:
Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.

Backpropagation learning fails to converge

I use a neural network with 3 layers for categorization problem: 1) ~2k neurons 2) ~2k neurons 3) 20 neurons. My training set consists of 2 examples, most of the inputs in each example are zeros. For some reason after the backpropagation training the network gives virtually the same output for both examples (which is either valid for only 1 of examples or have 1.0 for outputs where one of example has 1s). It comes to this state after the first epoch and doesn't change much afterwards, even if learning rate is minimal double vale. I use sigmoid as activation function.
I thought it could be something wrong with my code so I've used AForge open source library, and seems like it suffers from the same issue.
What might be the problem here?
Solution: I've removed one layer and decreased the number of neurons in hidden layer to 800
2000 by 2000 by 20 is huge. That's approximately 4 million weights to determine, meaning the algorithm has to search a 4-million-dimensional space. Any optimization algorithm will be totally at a loss in this case. I'm assuming you're using gradient descent, which is not even that powerful, so likely the algorithm is stuck in a local optimum somewhere in this gigantic search space.
Simplify your model!
Added:
And please also describe in more detail what you're trying to do. Do you really have only 2 training examples? That's like trying to categorize 2 points using a 4-million-dimensional plane. It doesn't make sense to me.
You mentioned that most of the inputs are zero. To your reduce the size of your search space, try removing redundancy in your training examples. For instance if
trainingExample[0].inputValue[i] == trainingExample[1].inputValue[i]
then x.inputValue[i] has no information bearing data for the NN.
Also, perhaps it's not clear, but it seems that two training examples seem small.

Artificial Neural Networks: Choosing initial neurons

How is the initial structure (Neurons and connections between them) chosen? My book only states that we give the connection random weights in the beginning before we train the network.
I think that we would add neurons during the training like this:
Start with a completely empty network
The first value I generate during the training will not exist
Add a neuron to correspond to this value, with a random weight
What you are after is a self-organizing ANN. Usually, the way the connections are organized is man-made into a model that the developer thinks will have sufficient power to perform the computation neccessary. You can of course start with a random selection of nodes with random connections, but the evolution of such a network will probably take a lot longer time than a standard two or three layer network.
So, yes, you are right in that you would use a similar approach when doing a self-organizing network. Keep track of two sets of genetic algorithms, one for the structure and one for the weights (or combine the two in some devious way) and evolve as you please.
I do not believe the question is about self-organising or GA-evolved ANNs. It sounds more like it is about a the most common ANN: a perceptron (single or multi-layer), in which case the structure of the network: the number of layers and the size of the layers, must be hand chosen at the beginning. A simple initial rule of thumb for initialising the weight is simply picking uniformly random values between -1.0 and 1.0.