I've trained an xgboost classifier (on train_df) and tuned (on valid_df) and tested (on test_df). Some non-trivial observations follow. After running HyperOpt trials, I obtain the following performance with the Precision scores.
Model 1: train: 0.16, valid: 0.12, test:0.12
Model 2: train: 0.23, valid: 0.18, test:0.17
Before rushing to conclude that these two models are overfitting, can somebody tell what's happening here and what can be done to resolve it? If this is indeed overfitting is it something to worry about? I am getting consistent results in valid_df and test_df? In addition, on an out-of-sample(i.e., unseen data) evaluation, I get a performance consistent with test_df performance. Since we care only about the test_df performance metric since it's an indication of the real performance of the model, do we need to worry about how much overfitting (if this is overfitting) the model is having on train_df? At the end of the day, test_df performance is all what matters.
Related
I am working on a anaomaly detection/classification problem.
I trained a model HistGradientBoostingClassifier in sklearn.
The dataset is imbalanced, so I used f1 score as the metric to validate the model performance.
The model seems to perform well during the fitting process with GridSearchCV, and it performed well too on the test set.
However, when I tested it with new dataset, the model performance is very bad
So I have a few questions:
In the first image, you can see train loss is much less than the validation loss. Is this an indication of overfitting ?
If it is overfitting, why does it perform well on the test data(f1 score is about 0.9) ?
Why does it perform so bad on new data ? (f1 score is about 0.06 in the 2nd image)
What should be my next step to tackle this problem ?
i think you must try SMOTE or SMOTETomek on your train data before fitting .
SMOTE and SMOTETomek algorithms available in imbalanced-learn : SMOTE
I'm doing regression using Neural Networks. It should be a simple task for NN to do, I have 10 features and 1 output that I want to predict.I’m using pytorch for my project but my Model is not learning well. the loss start with a very high value (40000), then after the first 5-10 epochs the loss decrease rapidly to 6000-7000 and then it stuck there, no matter what I make. I tried even to change to skorch instead of pytorch so that I can use cross validation functionality but that also didn’t help. I tried different optimizers and added layers and neurons to the network but that didn’t help, it stuck at 6000 which is a very high loss value. I’m doing regression here, I have 10 features and I’m trying to predict one continuous value. that should be easy to do that’s why it is confusing me more.
here is my network:
I tried here all the possibilities from making more complex architectures like adding layers and units to batch normalization, changing activations etc.. nothing have worked
class BearingNetwork(nn.Module):
def __init__(self, n_features=X.shape[1], n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(),
nn.Linear(512, 64),
nn.BatchNorm1d(64),
nn.LeakyReLU(),
nn.Linear(64, n_out),
# nn.LeakyReLU(),
# nn.Linear(256, 128),
# nn.LeakyReLU(),
# nn.Linear(128, 64),
# nn.LeakyReLU(),
# nn.Linear(64, n_out)
)
def forward(self, x):
out = self.model(x)
return out
and here are my settings:
using skorch is easier than pytorch. here I'm monitoring also the R2 metric and I made RMSE as a custom metric to also monitor the performance of my model. I also tried the amsgrad for Adam but that didn't help.
R2 = EpochScoring(r2_score, lower_is_better=False, name='R2')
explained_var_score = EpochScoring(EVS, lower_is_better=False, name='EVS Metric')
custom_score = make_scorer(RMSE)
rmse = EpochScoring(custom_score, lower_is_better=True, name='rmse')
bearing_nn = NeuralNetRegressor(
BearingNetwork,
criterion=nn.MSELoss,
optimizer=optim.Adam,
optimizer__amsgrad=True,
max_epochs=5000,
batch_size=128,
lr=0.001,
train_split=skorch.dataset.CVSplit(10),
callbacks=[R2, explained_var_score, rmse, Checkpoint(), EarlyStopping(patience=100)],
device=device
)
I also standardize the Input values.
my Input have the shape:
torch.Size([39006, 10])
and shape of output is:
torch.Size([39006, 1])
I’m using 128 as my Batch_size but I also tried other values like 32, 64, 512 and even 1024. Although normalizing output is not necessary but I also tried that and It didn’t work when I predict values, the loss is high. Please someone help me on this, I would appreciate every helpful advice. I ll also add a screenshot of my training and val losses and metrics over epochs to visualize how the loss is decreasing in the first 5 epochs and then it stays like forever at the value 6000 which is a very high value for a loss.
considering that your training and dev loss are decreasing over time, it seems like your model is training correctly. With respect to your worry regarding your training and dev loss values, this is entirely dependent on the scale of your target values (how big are your target values?) and the metric used to compute the training and dev losses. If your target values are big and you want smaller train and dev loss values, you can normalise the target values.
From what I gather with respect to your experiments as well as your R2 scores, it seems that you are looking for a solution in the wrong area. To me, it seems like your features aren't strong enough considering that your R2 scores are low, which could mean that you have a data quality issue. This would also explain why your architecture tuning has not improved your model's performance as it is not your model that is the issue. So if I were you, I would think about what new useful features I could add and see if that helps. In machine learning, the general rule is that models are only as good as the data that they are trained on. I hope this helps!
The metric you should be looking at is R^2, not the magnitude of the loss function. The purpose of a loss function is just to let the optimizer know if it's going in the right direction--it's not a measure of fit that's comparable across data sets and learning setups. That's what R^2 is for.
Your R^2 scores show that you're explaining around a third of the total variance in the output, which is often a very good result for a data set with only 10 features. Actually, given the shape of your data, it's more likely that your hidden layers are considerably larger than necessary and risk over fitting.
To really evaluate this model, you'd need to know (1) how the R^2 score compares to simpler regression approaches like OLS and (2) why you should have any confidence that more than 30% of the output variance should be captured by the input variables.
For #1, at least the R^2 shouldn't be worse. As for #2, consider the canonical digit categorization example. We know that all the information necessary to recognize digits with very high accuracy (i.e. R^2 approaching 1) because humans can do it. That's not necessarily the case with other data sets, because there are important sources of variance that aren't captured in the source data.
As your loss decreases from 40000 to 6000, that means your NN model has learnt the prevalent relation but not all of them. You can aid this learning by transforming the predictor variables and then feeding them as derived ones to your model and see if that helps. You can try performing step wise addition of features to your NN model, by adding the most influential predictors first. At every iteration evaluate the model performance (i.e. training loss).
If first step doesn't help and as you are open to other approaches, Presuming your data's dynamics, Gaussian process Regression or Quantile regression should help as these methods are free from assumptions like linear regression techniques. Also it should help to explore different aspects of relationship between your independent and dependent variable.
so to be more clear lets consider the problem of loan default prediction.
Let's say I have trained and tested off-line multiple classifiers and ensembled them. Then I gave this model to production.
But because people change, data and many other factors change as well. And performance of our model eventually will decrease. So then it needs to be replaced with the new, better model.
What are the common techniques, model stability tests, model performance tests, metrics after deployment? How to decide when to replace current model with newer one?
It depend wich problem (classification, regression or clustering), let say you have a classification problem and you learned and tested a model with 75% of accuracy (or other metric) , once in production, if the accuracy is Significantly less than 75% , than you can stope your model and see what is happening.
In my case, I note the accuracy of the model once in production each day for a week, after that I count a mean and a variance of accuracy and I apply a T-test of mean, to see if this accuracy is Significantly diverge or not from the accuracy desired.
Hope that will help
I am performing a classification problem using 3 different classifiers namely, Decision Tree, Naive Bayes and IBK. I have two data sets which are the same in layout and attribute names but the values in each are different.
Training Set Example;
State
Population
HouseholdIncome
FamilyIncome
perCapInc
NumUnderPov
EducationLevel_1
EducationLevel_2
EducationLevel_3
UnemploymentRate
EmployedRate
ViolentCrimesPerPop
Crime
Rate
8, 0.19, 0.37, 0.39, 0.4, 0.08, 0.1, 0.18, 0.48, 0.27 ,0.68 ,0.2 ,Low
I would like my decision tree to predict using the 12 attributes if the Target Class value is Low, Med or High based on the ViolentCrimesPerPop figure which in this example is 0.2.
My question is.... On my Test set do I just provide more un-seen examples in the same format or should I take away one of the attributes so i can see if it has learnt anything?
It is a good idea to separate your dataset into three separate sets: Training, Testing and Validation.
The training set is used to train each of the models that you are building. This is usually checked for performance using a testing set. As the designer continues to adjust the parameters of their model (for example, pruning options on Decision Trees and k for k-NN or Neural Network parameters), you can see how well the model is performing against the testing set.
Finally, once these parameters have been completed for your model, you can then run these against a validation set to confirm that the model did not over-fit on the testing data (due to parameter adjustments applied to the model itself).
A further discussion of these sets may be found here.
Generally, I have used a data split of 60-20-20, however it is common to use 50-25-25 as well, it really comes down to how much data you have to play with.
I hope this helps!
It is not a good thing to test your classifier over your same training data, because your model has learnt, hopefully, to classify those instances correctly.
The usual set up is to train over the training dataset and then test it over a different dataset (with the same format/structure), to see how it performs.
I am building a system with a NN trained for classification.
I am interested in what is error rate for systems you have built?
Classic example from UCI ML is the Iris data set.
NN trained on it is almost perfect - error rate 0-1%; however it is a very basic dataset.
My network has following structure: 80in, 208hid, 2out.
My result is 8% error rate on testing dataset.
Basically in this question I want to ask about various research results you encountered,
in your work, papers etc.
Addition 1:
the error rate is of course on testing data - not training. So it is completely new dataset for the network
Addition 2 (from my comment under the question):
My new results. 1200 entries, 900 training, 300 testing. 85 in Class1, 1115 in Class2. Out of 85, 44 in testing set. Error rate - 6%. It is not so bad because 44 is ~15% of 300. So I am 2.5 times better..
Model performance is completely problem-specific. Even among situations with similar quality and volumes of development data, with identical target variable definitions, performance can vary substantially. Obviously, the more similar the problem definitions, the more likely the performance of different models are to match.
Another thing to consider is the difference between technical performance and business performance. In some applications, an accuracy of 52% is tremendously profitable, whereas in other areas, and accuracy of 98% would be hopelessly low.
Let me also add that besides what Predictor mentions, measuring your performance on the training set is usually useless as a guide to determine how your classifier would perform on previously unseen data. Many times with relatively simple classifiers you can get 0% error rate on the training set without learning anything useful (this is called overfitting).
What is more commonly used (and more helpful in determining how your classifier works) is either held out data or cross validation, even better if you separate your data in three: training, validating and testing.
Also it is very hard to get a sense of how good a classifier works from one threshold and giving only true positive + true negatives. People tend to also evaluate false positives and false negatives and plot ROC curves to see/evaluate the tradeoff. So, saying "2.5 times better" you should be clear that your comparing to a classifier that classifies everything as C2, which is a pretty crappy baseline.
See for example this paper:
Danilo P. Mandic and Jonathon A. Chanbers (2000). Towards the Optimal Learning Rate for
Backpropagation, Neural Processing Letters 11: 1–5. PDF