Bizarre results with classification models - classification

I am running a few classification models like logistic regression and catboosting. I have taken away part of the train set as unseen data .
When I train both on train and unseen data and get the metrics using log regression I am getting all the metrics like accuracy , AUC,F1,Recall all greater than 0.90. As it's a class imbalance problem i have even balanced the classes using smote .And also I have used z score to normalise all variables
Where the model performs well on train and unseen data and on test data , when I actually run on the set of data ( unlabelled) which I want to predict model is only giving me 10 1s. And rest 150k 0s
Could there be really an issue with my model or it is indeed the way the data is ?

Related

Model perform badly on test data

I am working on a anaomaly detection/classification problem.
I trained a model HistGradientBoostingClassifier in sklearn.
The dataset is imbalanced, so I used f1 score as the metric to validate the model performance.
The model seems to perform well during the fitting process with GridSearchCV, and it performed well too on the test set.
However, when I tested it with new dataset, the model performance is very bad
So I have a few questions:
In the first image, you can see train loss is much less than the validation loss. Is this an indication of overfitting ?
If it is overfitting, why does it perform well on the test data(f1 score is about 0.9) ?
Why does it perform so bad on new data ? (f1 score is about 0.06 in the 2nd image)
What should be my next step to tackle this problem ?
i think you must try SMOTE or SMOTETomek on your train data before fitting .
SMOTE and SMOTETomek algorithms available in imbalanced-learn : SMOTE

Using clustering classification as regression feature?

I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you

How to do regularization in Matlab's NN toolbox

My data set has 150 independent variables and 10 predictors or response. The problem is to find a mapping between input and output variables. There are 1000 data points out of which 70% I have used for training and 30% for testing. I am using a feedforward neural network with 10 hidden neurons as explained in this Matlab document . I am evaluating the performance using the command
perf_Train = perform(net,TrainedData',lblTrain')
YPred = net(XTest);
perf_Test = perform(net,YPred,lblTest')
which basically gives the mean square error between the actual and the predicted (estimated) response for training and testing. My testing data is not able to fit properly to the trained model, however the training data fits quite well.
Problem1: My training performance is always lesser than test performance measure i.e., perf_Train = 0.0867 and perf_Test = 0.567
Is this overfitting or underfitting?
Problem2: How do I make the test data fit accurately? Theory say that to overcome overfitting and underfitting, we need to do regularization. Is there any parameter that needs to be input into the function such as regularization to overcome this?
It is overfitting since training error is lower than test error.
I would recommend to set less epochs(iteration) for your training or use less training data.
I would also recommend to check that the training data and test data are picked up randomly.
For regulation, it can be set like this:
net.performParam.regularization = 0.5;
The performance ratio depends on the model, 0.5 is just an example.
For more details, you can refer to the documentation below.
https://www.mathworks.com/help/deeplearning/ug/improve-neural-network-generalization-and-avoid-overfitting.html#bss4gz0-38

KNN giving highest accuracy with K=1?

I am using Weka's IBk for performing classification on text (tweets). I am converting the training and test data to vector space, and when I am performing the classification on test data, the best result comes from K=1. The training and testing data are separate from each other. Why does K=1 give the best accuracy?
Because you are using vectors; so at k=1 the value you get for proximity (for k=1) is more important than what the common class is in case of k=n (ex: when k=5)

How to calculate Training and testing accuracy in image classification using SVM in matlab

I am trying to classify the four groups of images using SVM method, by randomly selecting training and testing data each time. When T run the program the performance varies due to randomly selecting data. How to get accurate performance of my algorithm and also how to calculate training and testing accuracy?
The formula I am using for performance is
Performance = sum(PredictedLabels == test_labels) / numel(PredictedLabels)
I am using multisvm function for classification.
My suggestion:
Actually the performance measure is acceptable, though there are some other slightly better choices like #Dan has mentioned.
More importantly, you need to deal with randomness.
1) Everytime you select your training data, test the trained model with multiple randomized test data and average the accuracy. (e.g. 10 times or so)
2) Use multiple trained model and average the performance to get general performance.
Remark:
1) You need to make sure the training data and test data do not overlap. Or it is no longer test data.
2) It is better to have the training data have the same number of samples from each class label. This means you can partition your dataset in advance.