Mutate weights and biases in a neural network through genetic algorithm - neural-network

I have a genetic algorithm evolving a population of neural networks
Until now I make mutation on weights or biases using random.randn (Python) which is a random value from a normal distribution with mean = 0
It works "well" and I managed to achieve my project using it be wouldn't it be better to use a uniform distribution on a given interval ?
My intuition is that it would lead to more variety in my networks

I think, that this question has no simple solution. In case of normal distribution will be numbers around mean have more chances to be "selected" by your number generator, uniform distribution give almost equal chance to all numbers. That is clear but answer to question, will equal chance mean better result, lays according to me only at empirical experiments. So I suggest you to perform experiments with normal and uniform distribution a try to judge based on results.
About variety. I assume that you create some random vector which represents weights. At stage of mutation you perform addition of random number. This number will be more likely from close interval around mean, so in case 0 mutation with high probability will be change of some elements only little. So there will be only little improvements over vector and sometimes something big shows up. In case of uniform distribution will be changes more random, which leads to different individual. Question is, will be these individual better? I don't know, but I offer you another view. I look to genetic algorithms like an analogy to evolution theory. And from this point of view, cumulative little improvements of individual with little probability of some big change is more appropriate. Think about situation, used is uniform distribution, but children has low fitness due to big changes so at phase of creating new generation will be not selected. And you will wait so long for one tiny improvement which make your network works with good results.
Maybe one more thing. Your experiments maybe show that uniform/normal distribution is better. But such result may be true only for your current problem, no at general.

Related

Poorly calibrated probabilities but good classification in confusion matrix

I have an imbalanced data set. My goal is to balance sensitivity and specificity via the confusion matrix. I used glmnet in r with class weights. The model does well at balancing the sensitivity/specificity, but I looked at the calibration plot, and the probabilities are not well calibrated. I have read about calibrating probabilities, but I am wondering if it matters if my goal is to produce class predictions. If it does matter, I have not found a way to calibrate the probabilities when using caret::train().
This topic has been widely discussed, especially in some answers by Stephan Kolassa. I will try to summarize the main take-home messages for your specific question.
From a pure statistical point of view your interest should be on producing as output a probability for each class of any new data instance. As you deal with unbalanced data such probabilities can be small which however - as long as they are correct - is not an issue. Of course, some models can give you poor estimates of the class probabilities. In such cases, the calibration allows you to better calibrate the probabilities obtained from a given model. This means that whenever you estimate for a new observation a probability p of belonging to the target class, then p is indeed its true probability to be of that class.
If you are able to obtain a good probability estimator, then balancing sensitivity or specificity is not part of the statistical part of your problem, but rather of the decision component. Such the final decision will likely need to use some kind of threshold. Depending on the costs of type I and II errors, the cost-optimal threshold might change; however, an optimal decision might also include more than one threshold.
Ultimately, you really have to be careful about which is the specific need of the end-user of your model, because this is what is going to determine the best way of taking decisions using it.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

Neural Networks back propogation

I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Machine learning - training step

When you're using Haar-like features for your training data for an Adaboost algorithm, how do you build your data sets? Do you literally have to find thousands of positive and negative samples? There must be a more efficient way of doing this...
I'm trying to analyze images in matlab (not faces) and am relatively new to image processing.
Yes, you do need many positive and negative samples for training. This is especially true for Adaboost, which works by repeatedly resampling the training set. How many samples is enough is hard to say. But generally, the more the better, because that increases the chances of your training set being representative.
Also, it seems to me that your quest for efficiency is misplaced. Training is done ahead of time, presumably off-line. It is the efficiency of classifying unknown instances after the training is done, that people usually worry about.
Undoubtedly, more data, more information, better result. You should include more information as possible. However, one thing you may need care is the ratio of positive set to negative set. For logistic regression, the ratio should not be over 1:5, for adaboost, I'm not really sure with the result, but it will certainly change with the ratio (I tried before).
Yes we need many positive and negative samples for the training but the collection of those data is very tedious. But you can make it easy by taking videos instead of pictures and using ffmpeg to convert those videos into pictures. It will make the training part much easier.
The only reason to have kind of equal positive and negative samples is to avoid bias. Sometimes you might get high accuracy , but it completely fails to classify one category. To evaluate such methods precision/recall are more useful than accuracy.