Confusion matrix doesn't match evaluation - neural-network

I'm training with TensorFlow a CNN for image classification on dataset Food-101 and I reach a test accuracy of about 80% (I use model.evaluate()).
The issue I have is that when I plot the confusion matrix of the 3 classes involved this is very different and have maximum 40% on the main diagonal.
I could understand it if at least 1 of the 3 classes was around 100%, because I would expect the averaged accuracy to raise even with bad results in the other predictions. But in this case none of them is neither similar to what I achieve during evaluation.
I tried to plot the confusion matrix with training data, with which I reached more than 90% accuracy during the learning process, and also this is not correct.

Related

Dimensionality reduction, noralization, resampling, k-fold CV... In what order?

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!

Poor performance for SVM for unbalanced dataset- how to improve?

Consider a dataset A which has examples for training in a binary classification problem. I have used SVM and applied the weighted method (in MATLAB) since the dataset is highly imbalanced. I have applied weights as inversely proportional to the frequency of data in each class. This is done on training using the command
fitcsvm(trainA, trainTarg , ...
'KernelFunction', 'RBF', 'KernelScale', 'auto', ...
'BoxConstraint', C,'Weight',weightTrain );
I have used 10 folds cross-validation for training and learned the hyperparameter as well. so, inside CV the dataset A is split into train (trainA) and validation sets (valA). After training is over and outside the CV loop, I get the confusion matrix on A:
80025 1
0 140
where the first row is for the majority class and the second row is for the minority class. There is only 1 false positive (FP) and all minority class examples have been correctly classified giving true positive (TP) = 140.
PROBLEM: Then, I run the trained model on a new unseen test data set B which was never seen during training. This is the confusion matrix for testing on B .
50075 0
100 0
As can be seen, the minority class has not been classified at all, hence the purpose of weights has failed. Although, there is no FP the SVM fails to capture the minority class examples.
I have not applied any weights or balancing method such as sampling (SMOTE, RUSBoost etc) on B. What could be wrong and how to overcome this problem?
Class misclassification weights could be set instead of sample weights!
You can set the class weights based on the following example.
Mis-classification weight for class A(n-records; dominant) into class B (m-records; minority class) can be n/m.
Mis-classification weight For class B as class A can be set as 1 or m/n based on the severity, which you want to impose on the learning
c=[0 2.2;1 0];
mod=fitcsvm(X,Y,'Cost',c)
According to documentation:
For two-class learning, if you specify a cost matrix, then the
software updates the prior probabilities by incorporating the
penalties described in the cost matrix. Consequently, the cost matrix
resets to the default. For more details on the relationships and
algorithmic behavior of BoxConstraint, Cost, Prior, Standardize, and
Weights, see Algorithms.
Area Under Curve (AUC) is usually used to measure performance of models that applied on unbalanced data. It is also good to plot ROC curve to visually get more insights. Using only confusion matrix for such models may lead to misinterpretation.
perfcurve from the Statistics and Machine Learning Toolbox provides both functionalities.

sklearn DecisionTreeClassifier more depth less accuracy?

I have two learned sklearn.tree.tree.DecisionTreeClassifiers. Both are trained with the same training data. Both learned with different maximum depths for the decision trees. The depth for the decision_tree_model was 6 and the depth for the small_model was 2. Besides the max_depth, no other parameters were specified.
When I want to get the accuracy on the training data of them both like this:
small_model_accuracy = small_model.score(training_data_sparse_matrix, training_data_labels)
decision_tree_model_accuracy = decision_tree_model.score(training_data_sparse_matrix, training_data_labels)
Surprisingly the output is:
small_model accuracy: 0.61170212766
decision_tree_model accuracy: 0.422496238986
How is this even possible? Shouldn't a tree with a higher maximum depth always have a higher accuracy on the training data when learned with the same training data? Is it maybe that score function, which outputs the 1 - accuracy or something?
EDIT:
I just tested it with even higher maximum depth. The value returned becomes even lower. This hints at it being 1 - accuracy or something like that.
EDIT#2:
It seems to be a mistake I made with working with the training data. I thought about the whole thing again and concluded: "Well if the depth is higher, the tree shouldn't be the reason for this. What else is there? The training data itself. But I used the same data! Maybe I did something to the training data in between?"
Then I checked again and there is a difference in how I use the training data. I need to transform it from an SFrame into a scipy matrix (might have to be sparse too). Now I made another accuracy calculation right after fitting the two models. This one results in 61% accuracy for the small_model and 64% accuracy for the decision_tree_model. That's only 3% more and still somewhat surprising, but at least it's possible.
EDIT#3:
The problem is resolved. I handled the training data in a wrong way and that resulted in different fitting.
Here is the plot of accuracy after fixing the mistakes:
This looks correct and would also explain why the assignment creators chose to choose 6 as the maximum depth.
Shouldn't a tree with a higher maximum depth always have a higher
accuracy when learned with the same training data?
No, definitely not always. The problem is you're overfitting your model to your training data in fitting a more complex tree. Hence, the lower score as increase the maximum depth.

Evaluating performance of Neural Network embeddings in kNN classifier

I am solving a classification problem. I train my unsupervised neural network for a set of entities (using skip-gram architecture).
The way I evaluate is to search k nearest neighbours for each point in validation data, from training data. I take weighted sum (weights based on distance) of labels of nearest neighbours and use that score of each point of validation data.
Observation - As I increase the number of epochs (model1 - 600 epochs, model 2- 1400 epochs and model 3 - 2000 epochs), my AUC improves at smaller values of k but saturates at the similar values.
What could be a possible explanation of this behaviour?
[Reposted from CrossValidated]
To cross check if imbalanced classes are an issue, try fitting a SVM model. If that gives a better classification(possible if your ANN is not very deep) it may be concluded that classes should be balanced first.
Also, try some kernel functions to check if this transformation makes data linearly separable?

Understanding Matlab Pattern Recognition Neural Network Plots

I was currently doing a project on Vehicle classification and it has almost finished now but I have several confusion about the plots I get from my Neural Network
I used 230 images [90=Hatchbacks,90=Sedans,50=SUVs] for classification on 80 feature points.
Thus my vInput was a [80x230] matrix and my vTarget was [3x230] matrix
Classifier works well but I don't understand these plots or if they are abnormal or not.
My neural Network
Then I clicked these 4 plots in the PLOT section and got these sequentially.
Performance Plot
Training State
Confusion Plot
Receiver Operating Characteristic Plot
I know the images they are a lots of images but I know nothing about them.
On the matlab documentation they just train the system and plot the graph
So please someone briefly explain them to me or show me some good links to learn them.
First two plots shows training statistscs.
Performance Plot shows you mean square error dynamics for all your datasets in logarithmic scale. Training MSE is always decreasing, so its validation and test MSE you should be interested in. Your plot shows a perfect training.
Training State shows you some other training statistics.
Gradient is a value of backpropagation gradient on each iteration in logarithmic scale. 5e-7 means that you reached the bottom of the local minimum of your goal function.
Validation fails are iterations when validation MSE increased its value. A lot of fails means owertrainig, but in you case its OK. Matlab automatically stops training after 6 fails in a row.
The other two plots shows you the results of your network simulation after training.
Confusion Plot. In your case its 100% accurate. Green cells represent correct answers and red cells represent all types of incorrect answers.
For example, you may read the first one (training set) as: "59 samples from the class 1 was corrctly classified as class 1, 13 samples from the class 2 was corrctly classified as class 2 and 6 samples from the class 3 was corrctly classified as class 3".
Receiver Operating Characteristic Plot shows the same thing, but in a different way - using ROC curve: