Normalization before data split in Neural Network - neural-network

I am trying to run a MLP regressor on my dataset with one hidden layer. I am doing a standardization of my data but I want to be clear as whether it matters if I do the standardization after or before splitting the dataset in Training and Test set. I want to know if there will be any difference in my prediction values if I carry out standardization before data split.

Yes and no. If mean and variance of the training and test set are different, standardization can lead to a different outcome.
That being said, a good training and test set should be similar enough so that the data points are distributed in a similar way, and post-split standardization should give the same results.

You should absolutely do it before splitting.
Imagine having [1,2,3,4,5,6,7,8,9,10] as your inputs, which get split into [1, 2, 3, 4, 5, 7, 9, 10] for train and [6,8] for test.
It's immediately clear that min-max ranges, as well as the mean and standard deviation of both samples are completely different, so by applying standardization "post-split", you are completely scrambling the relationship between the values in the 1st and the 2nd set.

Related

why is my Neural Network stuck at high loss value after the first epochs

I'm doing regression using Neural Networks. It should be a simple task for NN to do, I have 10 features and 1 output that I want to predict.I’m using pytorch for my project but my Model is not learning well. the loss start with a very high value (40000), then after the first 5-10 epochs the loss decrease rapidly to 6000-7000 and then it stuck there, no matter what I make. I tried even to change to skorch instead of pytorch so that I can use cross validation functionality but that also didn’t help. I tried different optimizers and added layers and neurons to the network but that didn’t help, it stuck at 6000 which is a very high loss value. I’m doing regression here, I have 10 features and I’m trying to predict one continuous value. that should be easy to do that’s why it is confusing me more.
here is my network:
I tried here all the possibilities from making more complex architectures like adding layers and units to batch normalization, changing activations etc.. nothing have worked
class BearingNetwork(nn.Module):
def __init__(self, n_features=X.shape[1], n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(),
nn.Linear(512, 64),
nn.BatchNorm1d(64),
nn.LeakyReLU(),
nn.Linear(64, n_out),
# nn.LeakyReLU(),
# nn.Linear(256, 128),
# nn.LeakyReLU(),
# nn.Linear(128, 64),
# nn.LeakyReLU(),
# nn.Linear(64, n_out)
)
def forward(self, x):
out = self.model(x)
return out
and here are my settings:
using skorch is easier than pytorch. here I'm monitoring also the R2 metric and I made RMSE as a custom metric to also monitor the performance of my model. I also tried the amsgrad for Adam but that didn't help.
R2 = EpochScoring(r2_score, lower_is_better=False, name='R2')
explained_var_score = EpochScoring(EVS, lower_is_better=False, name='EVS Metric')
custom_score = make_scorer(RMSE)
rmse = EpochScoring(custom_score, lower_is_better=True, name='rmse')
bearing_nn = NeuralNetRegressor(
BearingNetwork,
criterion=nn.MSELoss,
optimizer=optim.Adam,
optimizer__amsgrad=True,
max_epochs=5000,
batch_size=128,
lr=0.001,
train_split=skorch.dataset.CVSplit(10),
callbacks=[R2, explained_var_score, rmse, Checkpoint(), EarlyStopping(patience=100)],
device=device
)
I also standardize the Input values.
my Input have the shape:
torch.Size([39006, 10])
and shape of output is:
torch.Size([39006, 1])
I’m using 128 as my Batch_size but I also tried other values like 32, 64, 512 and even 1024. Although normalizing output is not necessary but I also tried that and It didn’t work when I predict values, the loss is high. Please someone help me on this, I would appreciate every helpful advice. I ll also add a screenshot of my training and val losses and metrics over epochs to visualize how the loss is decreasing in the first 5 epochs and then it stays like forever at the value 6000 which is a very high value for a loss.
considering that your training and dev loss are decreasing over time, it seems like your model is training correctly. With respect to your worry regarding your training and dev loss values, this is entirely dependent on the scale of your target values (how big are your target values?) and the metric used to compute the training and dev losses. If your target values are big and you want smaller train and dev loss values, you can normalise the target values.
From what I gather with respect to your experiments as well as your R2 scores, it seems that you are looking for a solution in the wrong area. To me, it seems like your features aren't strong enough considering that your R2 scores are low, which could mean that you have a data quality issue. This would also explain why your architecture tuning has not improved your model's performance as it is not your model that is the issue. So if I were you, I would think about what new useful features I could add and see if that helps. In machine learning, the general rule is that models are only as good as the data that they are trained on. I hope this helps!
The metric you should be looking at is R^2, not the magnitude of the loss function. The purpose of a loss function is just to let the optimizer know if it's going in the right direction--it's not a measure of fit that's comparable across data sets and learning setups. That's what R^2 is for.
Your R^2 scores show that you're explaining around a third of the total variance in the output, which is often a very good result for a data set with only 10 features. Actually, given the shape of your data, it's more likely that your hidden layers are considerably larger than necessary and risk over fitting.
To really evaluate this model, you'd need to know (1) how the R^2 score compares to simpler regression approaches like OLS and (2) why you should have any confidence that more than 30% of the output variance should be captured by the input variables.
For #1, at least the R^2 shouldn't be worse. As for #2, consider the canonical digit categorization example. We know that all the information necessary to recognize digits with very high accuracy (i.e. R^2 approaching 1) because humans can do it. That's not necessarily the case with other data sets, because there are important sources of variance that aren't captured in the source data.
As your loss decreases from 40000 to 6000, that means your NN model has learnt the prevalent relation but not all of them. You can aid this learning by transforming the predictor variables and then feeding them as derived ones to your model and see if that helps. You can try performing step wise addition of features to your NN model, by adding the most influential predictors first. At every iteration evaluate the model performance (i.e. training loss).
If first step doesn't help and as you are open to other approaches, Presuming your data's dynamics, Gaussian process Regression or Quantile regression should help as these methods are free from assumptions like linear regression techniques. Also it should help to explore different aspects of relationship between your independent and dependent variable.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

can a neural network be trained to recognize abstract pattern forms?

i'm curious as to the kind of limitations even an expertly designed network might have. this one in particular is what i could use some insight on:
given:
a set of random integers of non-trivial size (say at least 500)
an expertly created/trained neural network.
task:
number anagram: create the largest representation of an infinite sequence of integers possible in a given time frame where the sequence
either can be represented in closed form (ie - n^2, 2x+5, etc) or is
registered in OEIS (http://oeis.org/). the numbers used to create the
sequence can be taken from the input set in any order. so if the
network is fed (3, 5, 1, 7...), returning (1, 3, 5, 7 ...) would be an
acceptable result.
it's my understanding that an ANN can be trained to look for a particular sequence pattern (again - n^2, 2x+5, etc). what I'm wondering is if it can be made to recognize a more general pattern like n^y or xy+z. my thinking is that it won't be able to, because n^y can produce sequences that look different enough from one another that a stable 'base pattern' can't be established. that is - intrinsic to the way ANNs work (taking sets of input and doing fuzzy-matching against a static pattern it's been trained to look for) is that they are limited in terms of scope of what it is they can be trained to look for.
have i got this right?
Continuing from the conversation I had with you in the comments:
Neural networks still might be useful. Instead of training a neural net to search for a single pattern, the neural net can be trained to predict the data. If the data contains a predictable pattern, the NN can learn it, and the weights of the NN will represent the pattern it has learned. I think that may be what you were intending to do.
Some things that might be helpful for you if you do this:
Autoencoders do unsupervised learning and can learn the structure of individual datapoints.
Recurrent Neural Networks can model sequences of data rather than just individual datapoints. This sounds more like what you are looking for.
A Compositional Pattern-Producing Network (CPPNs) is a really fancy word for a neural network with mathematical functions as activation functions. This would allow you to model functions that aren't easily approximated by NNs with simple activation functions like sigmoids or ReLU. But usually this isn't necessary, so don't worry to much about it until after you have a simple NN working.
Dropout is a simple technique where you remove half of the hidden units every iteration. This seems to seriously reduce overfitting. It prevents complicated relationships between neurons from forming, which should make the models more interpretable, which seems like your goal.

why we need cross validation in multiSVM method for image classification?

I am new to image classification, currently working on SVM(support Vector Machine) method for classifying four groups of images by multisvm function, my algorithm every time the training and testing data are randomly selected and the performance is varies at every time. Some one suggested to do cross validation i did not understand why we need cross validation and what is the main purpose of this? . My actual data set consist training matrix size 28×40000 and testing matrix size 17×40000. how to do cross validation by this data set help me. thanks in advance .
Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. As a result, you would like to select the model with the least validation error. Here the model refers to the features you want to use, and of more importance, the gamma and C in your SVM. After cross validation, you will use the selected gamma and C with the least average validation error to train the whole training data.
You may also need to estimate the performance of your features and parameters to avoid both high-bias and high-variance. Whether your model suffers underfitting or overfitting can be observed from both in-sample-error and validation error.
Ideally 10-fold is often used for cross validation.
I'm not familiar with multiSVM but you may want to check out libSVM, it is a popular, free SVM library with support for a number of different programming languages.
Here they describe cross validation briefly. It is a way to avoid over-fitting the model by breaking up the training data into sub groups. In this way you can find a model (defined by a set of parameters) which fits both sub groups optimally.
For example, in the following picture they plot the validation accuracy contours for parameterized gamma and C values which are used to define the model. From this contour plot you can tell that the heuristically optimal values (from those tested) are those that give an accuracy closer to 84 instead of 81.
Refer to this link for more detailed information on cross-validation.
You always need to cross-validate your experiments in order to guarantee a correct scientific approach. For instance, if you don't cross-validate, the results you read (such as accuracy) might be highly biased by your test set. In an extreme case, your training step might have been very weak (in terms of fitting data) and your test step might have been very good. This applies to ALL machine learning and optimization experiments, not only SVMs.
To avoid such problems just divide your initial dataset in two (for instance), then train in the first set and test in the second, and repeat the process invesely, training in the second and testing in the first. This will guarantee that any biases to the data are visible to you. As someone suggested, you can perform this with even further division: 10-fold cross-validation, means dividing your data set in 10 parts, then training in 9 and testing in 1, then repeating the process until you have tested in all parts.

Accuracy of Neural network Output-Matlab ANN Toolbox

I'm relatively new to Matlab ANN Toolbox. I am training the NN with pattern recognition and target matrix of 3x8670 containing 1s and 0s, using one hidden layer, 40 neurons and the rest with default settings. When I get the simulated output for new set of inputs, then the values are around 0 and 1. I then arrange them in descending order and choose a fixed number(which is known to me) out of 8670 observations to be 1 and rest to be zero.
Every time I run the program, the first row of the simulated output always has close to 100% accuracy and the following rows dont exhibit the same kind of accuracy.
Is there a logical explanation in general? I understand that answering this query conclusively might require the understanding of program and problem, but its made of of several functions to clearly explain. Can I make some changes in the training to get consistence output?
If you have any suggestions please share it with me.
Thanks,
Nishant
Your problem statement is not clear for me. For example, what you mean by: "I then arrange them in descending order and choose a fixed number ..."
As I understand, you did not get appropriate output from your NN as compared to the real target. I mean, your output from NN is difference than target. If so, there are different possibilities which should be considered:
How do you divide training/test/validation sets for training phase? The most division should be assigned to training (around 75%) and rest for test/validation.
How is your training data set? Can it support most scenarios as you expected? If your trained data set is not somewhat similar to your test data sets (e.g., you have some new records/samples in the test data set which had not (near) appear in the training phase, it explains as 'outlier' and NN cannot work efficiently with these types of samples, so you need clustering approach not NN classification approach), your results from NN is out-of-range and NN cannot provide ideal accuracy as you need. NN is good for those data set training, where there is no very difference between training and test data sets. Otherwise, NN is not appropriate.
Sometimes you have an appropriate training data set, but the problem is training itself. In this condition, you need other types of NN, because feed-forward NNs such as MLP cannot work with compacted and not well-separated regions of data very well. You need strong function approximation such as RBF and SVM.